Data processor and method of processing data

ABSTRACT

A second decoder ( 114 ) of an instruction decode unit ( 119 ) decodes an operation code for a multiply-add operation, and a second operation unit ( 117 ) receives two data stored in a register file ( 115 ) to perform the multiply-add operation. In parallel with the operations of the second decoder ( 114 ) and the second operation unit ( 117 ), a first decoder ( 113 ) of the instruction decode unit ( 119 ) decodes an operation code for 2 data load, and an operand access unit ( 104 ) causes two data (e.g., n bits each) stored in an internal data memory ( 105 ) to be transferred in parallel in the form of combined 2n-bit data to a first operation unit ( 116 ). Then, two predetermined registers of the register file ( 115 ) store the respective n-bit data from the first operation unit ( 116 ).

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to a data processor for high-speed digitalsignal processing and a method of processing data for high-speed digitalsignal processing.

2. Description of the Background Art

Digital signal processors (DSPs) having an architecture suitable forsignal processing have been used as data processors designedspecifically for high-speed digital signal processing. These DSPsexecute processing frequently used in signal processing such as amultiply-add operation at high speeds. An example of a DSP is MotorolaDSP56000. The DSP56000 includes two address pointers, two data memories,and a multiply-add operation unit. Parallel loading of data (e.g., theload of coefficients and data) from two 1-word memories specifiedrespectively by the address pointers, updating of the two addresspointers, and the execution of the combined multiply-add operationallows the multiply-add operation to be executed with a high throughput(See DSP56000 Digital Signal Processor Family Manual, 1992). In thismanner, the DSP normally has two memories. Data are distributed toeither of the memories. Some DSPs use a 2-port RAM for efficient datatransfer.

An example of microprocessors incorporating the DSP function includesMotorola CPU16. The CPU16 may repeatedly perform the multiply-addoperation and 2-word load in response to one RMAC instruction. However,the CPU16 wherein one multiply-add operation requires 12 cycles isdifficult to achieve the performance competing with the DSPs (CPU16Reference Manual, 1993).

In recent years, some microprocessors have been intended forimplementing signal processing by means of software as the operatingfrequency improves. To improve the arithmetic performance, some of themicroprocessors additionally provide the multiply-add operationinstructions and make the most of sophisticated parallel processingtechniques such as superpipeline and superscalar to achieve DSP-levelperformance. For example, PowerPC603 (Motorola and IBM) may execute asingle-precision floating-point multiply-add operation with one clockcycle throughput by using 3-stage pipeline processing. This requires theamount of hardware and significantly complicated control. To perform onemultiply-add operation for each clock cycle, one clock cycle requires2-word data. The PowerPC603 may load a maximum of one word for eachclock cycle, resulting in an insufficient supply of operands(Proceedings of COMPCON 1994: “The PowerPC603 Microprocessor: A HighPerformance, Low Power, Superscalar RISC Microprocessor”, PowerPC603RISC Microprocessor User's Manual, 1994).

The DSPs which must include two memories have a complicated memoryconstruction and require very cumbersome data management fordistribution of data between the two memories. The use of a 2-port RAMadds to the area and costs of the data processor. Additionally, the DSPis in general an accumulator machine and is difficult to executecomplicated data processing.

The microprocessors which require one memory have a relatively simplememory construction. However, the microprocessors are not efficient insignal processing unlike the DSPs wherein hardware directly representsthe flow of signal processing. To achieve the DSP-level performance, thestate-of-the art microprocessors require an increased amount ofhardware, adding to the costs of the data processor. Further, themicroprocessors are difficult to reduce power consumption because of theneed for operation at high frequencies.

SUMMARY OF THE INVENTION

According to a first aspect of the present invention, a data processorcomprises: a first memory portion for storing an instruction including afirst operation code and a second operation code; a second memoryportion for storing data; an instruction decode unit for receiving theinstruction stored in the first memory portion, the instruction decodeunit including first and second decoders for decoding the first andsecond operation codes in parallel, respectively; a register fileportion including a plurality of registers for storing data to transferdata from and to the second memory portion; an operation unit forreceiving first data stored in a first register of the register fileportion to perform an arithmetic operation using the first data inresponse to a control signal, the control signal being the firstoperation code decoded by the first decoder of the instruction decodeunit; and an operand access unit operated in parallel with the operationunit for causing second and third data stored in the second memoryportion to be transferred in parallel and stored in second and thirdregisters of the register file portion, respectively, in response to acontrol signal, the control signal being the second operation codedecoded by the second decoder of the instruction decode unit.

Preferably, according to a second aspect of the present invention, thesecond and third data each are n bit (n is a natural number) in length,and the second and third data are combined together into 2n-bit datawhen the second and third data are transferred to the register fileportion.

According to a third aspect of the present invention, a data processorcomprises: a first memory portion for storing an instruction including afirst operation code and a second operation code; a second memoryportion for storing data; an instruction decode unit for receiving theinstruction stored in the first memory portion, the instruction decodeunit including first and second decoders for decoding the first andsecond operation codes in parallel, respectively; a register fileportion including a plurality of registers for storing data to transferdata from and to the second memory portion; an operation unit forreceiving first data stored in a first register of the register fileportion to perform an arithmetic operation using the first data inresponse to a control signal, the control signal being the firstoperation code decoded by the first decoder of the instruction decodeunit; and an operand access unit operated in parallel with the operationunit for causing second and third data stored respectively in second andthird registers of the register file portion to be transferred inparallel and stored in the second memory portion in response to acontrol signal, the control signal being the second operation codedecoded by the second decoder of the instruction decode unit.

Preferably, according to a fourth aspect of the present invention, thesecond and third data each are n bit (n is a natural number) in length,and the second and third data are combined together into 2n-bit datawhen the second and third data are transferred to the second memory.

Preferably, according to a fifth aspect of the present invention, theoperation unit includes a multiplier for multiplying together the firstdata and fourth data stored in a fourth register of the register fileportion, and an adder for adding at least two data together, the adderadding together the result of multiplication of the multiplier and datastored in a register of the register file portion to cause a register ofthe register file portion to store the result of addition.

Preferably, according to a sixth aspect of the present invention, theoperation unit includes a multiplier for multiplying together the firstdata and fourth data stored in a fourth register of the register fileportion, and an adder for adding at least two data together, the adderadding together the result of multiplication of the multiplier and datastored in a register of the register file portion to cause a register ofthe register file portion to store the result of addition.

Preferably, according to a seventh aspect of the present invention, theoperation unit includes a multiplier for multiplying together the firstdata and fourth data stored in a fourth register of the register fileportion, an adder for adding at least two data together, and anaccumulator for holding a result of an operation, the adder addingtogether the result of multiplication of the multiplier and the dataheld in the accumulator to cause the accumulator to hold the result ofaddition.

Preferably, according to an eighth aspect of the present invention, theoperation unit includes a multiplier for multiplying together the firstdata and fourth data stored in a fourth register of the register fileportion, an adder for adding at least two data together, and anaccumulator for holding a result of an operation, the adder addingtogether the result of multiplication of the multiplier and the dataheld in the accumulator to cause the accumulator to hold the result ofaddition.

According to a ninth aspect of the present invention, a data processorcomprises: a memory portion for storing data; an instruction decode unitfor receiving a first instruction including first and second operationcodes and a second instruction including third and fourth operationcodes and to be processed after the first instruction to decode thefirst and second operation codes and the third and fourth operationcodes in parallel; a register file portion connected to the memoryportion and including a plurality of registers each for storing data oran operand address; an operation unit for performing an arithmeticoperation of the data stored in the register file portion; and a memoryaccess portion operated in parallel with the operation unit for causingthe operand address stored in the register file portion to be applied tothe memory portion and for updating the operand address, wherein, in afirst processing, the instruction decode unit receives the firstinstruction, and executed is parallel processing of (a) the operationunit to receive first data stored in a first register of the registerfile portion to perform an arithmetic operation in response to a controlsignal which is outputted from the instruction decode unit decoding thefirst operation code, and (b) the memory access portion to cause a firstoperand address stored in a second register of the register file portionto be applied to the memory portion to cause second data stored in thememory portion to be transferred to a third register of the registerfile portion in response to a control signal which is outputted from theinstruction decoded unit decoding the second operation code and toupdate the first operand address to write a second operand address intothe second register in response to the control signal, and wherein, in asecond processing, the instruction decode unit receives the secondinstruction, and executed is parallel processing of (c) the operationunit to receive the second data stored in the third register of theregister file portion to perform an arithmetic operation in response toa control signal which is outputted from the instruction decode unitdecoding the third operation code, and (d) the memory access portion tocause the second operand address stored in the second register of theregister file portion to be applied to the memory portion to cause thirddata stored in the memory portion to be transferred to a fourth registerof the register file portion in response to a control signal which isoutputted from the instruction decode unit decoding the fourth operationcode and to update the second operand address to write a third operandaddress into the second register in response to the control signal, thefirst processing and the second processing being executed by pipelinecontrol.

A tenth aspect of the present invention is intended for a method ofprocessing data by a data processor which includes a memory portion forstoring data, a register file portion connected to the memory portionand including a plurality of registers each for storing data or anoperand address, an operation unit for receiving the data stored in theregister file portion to perform an arithmetic operation, and a memoryaccess portion for causing the operand address stored in the registerfile portion to be applied to the memory portion. According to thepresent invention, the method comprises the steps of: (a) transferringfirst and second data stored in a first area of the memory portion inparallel to write the first and second data into first and secondregisters of the register file portion, respectively; (b) transferringthird and fourth data stored in a second area of the memory portion inparallel to write the third and fourth data into third and fourthregisters of the register file portion, respectively; (c) applying thefirst data stored in the first register and the third data stored in thethird register to the operation unit to perform an arithmetic operationof the first and third data by the operation unit; and (d) applying thesecond data stored in the second register and the fourth data stored inthe fourth register to the operation unit to perform an arithmeticoperation of the second and fourth data by the operation unit.

Preferably, according to an eleventh aspect of the present invention,the method further comprises the steps of: (e) transferring fifth andsixth data stored in a third area of the memory portion in parallel towrite the firth and sixth data into fifth and sixth registers of theregister file portion, respectively; and (f) transferring seventh andeighth data stored in a fourth area of the memory portion in parallel towrite the seventh and eighth data into seventh and eighth registers ofthe register file portion, respectively, wherein one of the steps (c)and (d) is executed in parallel with at least one of the steps (e) and(f).

Preferably, according to a twelfth aspect of the present invention, thethird area is the same as the first area, and the fourth area is thesame as the second area.

Preferably, according to a thirteenth aspect of the present invention,the first and second data each are n bits (n is a natural number) inlength, and the first and second data are combined together into 2n-bitdata when the first and second data are transferred to the register fileportion.

Preferably, according to a fourteenth aspect of the present invention,the step (c) comprises the steps of: multiplying the first and thirddata together; and adding data stored in a ninth register to the resultof multiplication to store the result of addition as ninth data in theninth register, and the step (d) comprises the steps of: multiplying thefirst and fourth data together; and adding the ninth data stored in theninth register to the result of multiplication to store the result ofaddition in the ninth register.

In accordance with the first aspect of the present invention, the dataprocessor comprises the instruction decode unit including the first andsecond decoders, the register file, the operation unit, and the operandaccess unit. The first and second operation codes are decoded andexecuted in parallel, and the arithmetic operation and the access of twodata to the memory are executed in parallel, achieving high-speed dataprocessing. In particular, a DSP-level signal processing performance ofa microprocessor is implemented.

The simple construction may reduce the costs of the data processor.

The parallel processing of the multiply-add operation instruction andthe access of two data to the memory allows one multiply-add operationto be performed per clock cycle.

In accordance with the data processor of the ninth aspect of the presentinvention, a plurality of instructions including the operation code forspecifying the application of a memory operand to the register filewhile updating an address by using the register contents as the address,and the operation code for specifying the execution of the arithmeticoperation with reference to the register value are processed by means ofpipeline processing technique. This permits the arithmetic operations tobe executed without operand interference by means of software, improvingthe processing performance.

In accordance with the tenth aspect of the present invention, the methodof processing data comprises loading the first and second data inparallel from the memory to the register, loading the third and fourthdata in parallel from the memory to the register, performing thearithmetic operation of the first and third data, and performing thearithmetic operation of the second and fourth data. The access to thememory and the arithmetic operation are executed efficiently by usingone memory, improving the performance of the data processor. Inparticular, digital signal processing performance is greatly improvedunder simple control.

It is therefore an object of the present invention to provide aninexpensive high-performance microprocessor-type data processor whichreadily reduces power consumption under relatively simple control.

It is another object of the present invention to provide a dataprocessor having DSP-level digital signal processing performance.

It is still another object of the present invention to provide a methodof processing data which may achieve high-performance data processingcontrol.

These and other objects, features, aspects and advantages of the presentinvention will become more apparent from the following detaileddescription of the present invention when taken in conjunction with theaccompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a set FIGS. 1A through 1D illustrate sets ofregisters for a data processor according to a first preferred embodimentof the present invention;

FIG. 2 illustrates a processor status word for the data processoraccording to the first preferred embodiment of the present invention;

FIG. 3 illustrates an instruction format for the data processoraccording to the first preferred embodiment of the present invention;

FIG. 4 illustrates a short format of a 2-operand instruction for thedata processor according to the first preferred embodiment of thepresent invention;

FIG. 5 illustrates a short format of a branch instruction for the dataprocessor according to the first preferred embodiment of the presentinvention;

FIG. 6 illustrates a long format of a 3-operand instruction or aload/store instruction for the data processor according to the firstpreferred embodiment of the present invention;

FIG. 7 illustrates a format of an instruction having an operation codein its left-hand container for the data processor according to the firstpreferred embodiment of the present invention;

FIG. 8 is a functional block diagram of the data processor according tothe first preferred embodiment of the present invention;

FIG. 9 is a detailed block diagram of a first operation unit for thedata processor according to the first preferred embodiment of thepresent invention;

FIG. 10 is a detailed block diagram of a PC unit for the data processoraccording to the first preferred embodiment of the present invention;

FIG. 11 is a detailed block diagram of a second operation unit for thedata processor according to the first preferred embodiment of thepresent invention;

FIG. 12 illustrates pipeline processing for the data processor accordingto the first preferred embodiment of the present invention;

FIG. 13 illustrates a pipeline state when a load operand interferenceoccurs in the data processor according to the first preferred embodimentof the present invention;

FIG. 14 illustrates a pipeline state when an arithmetic hardwareinterference occurs in the data processor according to the firstpreferred embodiment of the present invention;

FIG. 15 illustrates a program of a 256 tap FIR filter for the dataprocessor according to the first preferred embodiment of the presentinvention;

FIG. 16 illustrates a bit pattern when a 2-word load instruction and amultiply-add operation instruction are executed in parallel in the dataprocessor according to the first preferred embodiment of the presentinvention;

FIG. 17 illustrates the contents of an internal instruction memorycorresponding to a loop part of the program of the FIR filter for thedata processor according to the first preferred embodiment of thepresent invention;

FIG. 18 illustrates mapping of an internal data memory in relation tocoefficients and data in the program of the FIR filter for the dataprocessor according to the first preferred embodiment of the presentinvention;

FIG. 19 shows respective positions of FIGS. 19A to 19C;

FIGS. 19A to 19C illustrate a flow of processing in a loop of theprogram of the FIR filter for the data processor according to the firstpreferred embodiment of the present invention;

FIG. 20 illustrates signal lines of an n-stage secondary direct-formtype-II IIR filter;

FIG. 21 illustrates a program of the IIR filter for the data processoraccording to the first preferred embodiment of the present invention;

FIG. 22 illustrates the contents of the internal instruction memorycorresponding to a loop part of the program of the IIR filter for thedata processor according to the first preferred embodiment of thepresent invention;

FIG. 23 illustrates mapping of the internal data memory in relation tocoefficients and data in the program of the IIR filter for the dataprocessor according to the first preferred embodiment of the presentinvention;

FIG. 24 shows respective positions of FIGS. 24A to 24C;

FIGS. 24A to 24C illustrate a flow of processing in a loop of theprogram of the IIR filter for the data processor according to the firstpreferred embodiment of the present invention;

FIG. 25 illustrates a loop part of a program of an IFFT for the dataprocessor according to the first preferred embodiment of the presentinvention;

FIG. 26 illustrates a loop part of a program of a subtract-absolute-addoperation for the data processor according to the first preferredembodiment of the present invention;

FIG. 27 shows respective positions of FIGS. 27A to 27C;

FIGS. 27A to 27C illustrate a flow of processing in the loop of theprogram of the ubtract-absolute-add operation for the data processoraccording to the first preferred embodiment of the present invention;

FIG. 28 is a detailed block diagram of the second operation unit for thedata processor according to a second preferred embodiment of the presentinvention;

FIG. 29 illustrates a program of a subtract-square-add operation for thedata processor according to the second preferred embodiment of thepresent invention;

FIG. 30 illustrates the contents of the internal instruction memorycorresponding to a loop part of the program of the subtract-square-addoperation for the data processor according to the second preferredembodiment of the present invention;

FIG. 31 illustrates mapping of the internal data memory in relation todata in the program of the subtract-square-add operation for the dataprocessor according to the second preferred embodiment of the presentinvention;

FIG. 32 shows respective positions of FIGS. 32A to 32C;

FIGS. 32A to 32C illustrate a flow of processing in the loop of theprogram of the subtract-square-add operation for the data processoraccording to the second preferred embodiment of the present invention;

FIG. 33 illustrates the contents of the internal instruction memorycorresponding to the loop part of the program of thesubtract-absolute-add operation for the data processor according to thesecond preferred embodiment of the present invention;

FIG. 34 shows respective positions of FIGS. 34A to 34C;

FIGS. 34A to 34C illustrate a flow of processing in the loop of theprogram of the subtract-absolute-add operation for the data processoraccording to the second preferred embodiment of the present invention;

FIG. 35 is a functional block diagram of the data processor according toa third referred embodiment of the present invention;

FIG. 36 illustrates an instruction format for the data processoraccording to the third preferred embodiment of the present invention;

FIG. 37 is a block diagram of the second operation unit for the dataprocessor according to a fourth preferred embodiment of the presentinvention;

FIG. 38 illustrates an instruction format for the data processoraccording to the fourth preferred embodiment of the present invention;

FIG. 39 illustrates a basic format of containers of an instruction forthe data processor according to the fourth preferred embodiment of thepresent invention; and

FIG. 40 illustrates a loop part of the program of the FIR filter for thedata processor according to the fourth preferred embodiment of thepresent invention.

DESCRIPTION OF THE PREFERRED EMBODIMENTS First Preferred Embodiment

A data processor according to a first preferred embodiment of thepresent invention will be described below. The data processor of thefirst preferred embodiment is a 16-bit processor whose addresses anddata are 16 bits in length.

FIG. 1 illustrates a set of registers for the data processor of thispreferred embodiment. The data processor employs big endian bit and byteordering wherein the most significant bit is the bit 0.

Sixteen general-purpose registers R0 to R15 are provided for storingdata and address values therein. The general-purpose registers R0 to R14are designated by the numerals 1 to 15 in FIG. 1, respectively. Thegeneral-purpose register R13 (designated at 14 in FIG. 1) is allocatedas a link (LINK) register for storing a return address for a subroutinejump. The general-purpose register R15 is a register for a stack pointer(SP) including an interruption stack pointer (SPI) 16 and a user stackpointer (SPU) 17. A processor status word (PSW) to be described laterswitches between the interruption stack pointer (SPI) 16 and the userstack pointer (SPU) 17. The SPI 16 and the SPU 17 are genericallyreferred to as an SP hereinafter. The number of each of the registers isspecified in a 4-bit register specification field unless otherwisespecified. The data processor of this preferred embodiment includes aninstruction for processing a pair of registers, e.g. R0 and R1. In thiscase, the pair of registers are specified in such a manner that aneven-numbered register is specified thereby to indirectly specify thecorresponding register having the odd number which equals the evennumber plus one.

The reference numerals 21 to 29 designate 16-bit control registers. Thenumber of each of the control registers is represented by 4 bits,similar to those of the general-purpose registers. The control registerCR0 designated at 21 is a register for the processor status word (PSW)including a bit for specifying the operating mode of the data processorand a flag indicative of the result of operations. FIG. 2 illustratesthe construction of the PSW 21. The reference numeral 41 designates anSM bit (bit 0) indicative of a stack mode for specifying thecorresponding relationship when the general-purpose register R15 isspecified as above described. The SM bit 41 indicates an interruptionmode when it is “0”. Then, the SPI is used as the general-purposeregister RIS. The SM bit 41 indicates a user mode when it is “1”. Then,the SPU is used as the general purpose-register R15. The referencenumeral 42 designates an IE bit (bit 5) for specifying an interruptionenable state. When the IE bit is “0”, the interruption is masked(ignored if asserted). When the IE bit is “1”, the interruption isaccepted. A repeat function for achieving zero-overhead loop processingis implemented in the data processor of this preferred embodiment. Thereference numeral 43 designates an RP bit (bit 6) indicative of a repeatstate. The RP bit indicates no repeat being executed when it is “0”. TheRP bit indicates a repeat being executed when it is “1”. A moduloaddressing function which is addressing for accessing a circular bufferis implemented in the data processor of this preferred embodiment. Thereference numeral 44 designates an MD bit (bit 7) for specifying amodulo enable state. When the MD bit is “0”, the modulo addressing isdisabled. When the MD bit is “1”, the modulo addressing is enabled. Thereference numeral 45 designates an execution control flag (bit 12) towhich the result of a comparison instruction or the like is set. Thereference numeral 46 designates a carry flag (bit 15) to which a carryis set when addition and subtraction instructions are executed.

The control register CR2 designated at 23 in FIG. 1 is a register for aprogram counter (PC) indicative of the instruction address beingexecuted. The length of the instruction processed by the data processorof this preferred embodiment is basically fixed at 32 bits. The programcounter 23 holds a word address wherein 32 bits make up one word.

The control register CR1 designated at 22 in FIG. 1 is a register for abackup processor status word (BPSW), and the control register CR3designated at 24 in FIG. 1 is a register for a backup program counter(BPC). The control registers CR1 and CR3 are registers for saving andholding the values of the PSW 21 and PC 23 being executed if anexception or an interruption is detected, respectively.

The control registers 25 to 27 are repeat-associated registers whichallow a user to read and write the values thereof so that aninterruption is accepted during a repeat. The control register CR7designated at 25 in FIG. 1 is a register for a repeat counter (RPT_C)for holding the count value indicative of the subsequent repeat count.The control register CR8 designated at 26 in FIG. 1 is a register for arepeat start address (RPT_S) for holding the first instruction addressin the block to be repeated. The control register CR9 designated at 27in FIG. 1 is a register for a repeat end address (RPT_E) for holding thelast instruction address in the block to be repeated.

The control registers 28 and 29 are provided to execute moduloaddressing. The control register CR10 designated at 28 in FIG. 1 holds amodulo start address (MOD_S), and the control register CR11 designatedat 29 in FIG. 1 holds a modulo end address (MOD_E). Both of the controlregisters CR10 and CR11 hold the first and last word (16 bits)addresses. When the modulo addressing is used during an increment, thelower address is set to the MOD_S 28, and the higher address is set tothe MOD_E 29. If the initial value held in the register to beincremented coincides with the address held in the MOD_E 29, the valueheld in the MOD_S 28 is written back to the register as an incrementedresult.

The reference numerals 31 and 32 designate 40-bit accumulators A0 and A1for holding the result of a multiply-add operation in an integer format.The accumulators A0 and A1 designated at 31 and 32 in FIG. 1 compriseareas A0H (31b) and A1H (32b) for holding the high-order 16 bits of theresult of the multiply-add operation, areas A0L (31c) and A1L (32c) forholding the low-order 16 bits of the result of the multiply-addoperation, and 8 guard bit areas A0G (31a) and A1G (32a) for holdingbits overflown out of the high order bit of the result of themultiply-add operation, respectively.

The data processor of the first preferred embodiment processes a 2-wayVLIW (very long instruction word) instruction set. FIG. 3 illustrates aninstruction format for the data processor of the first preferredembodiment. The length of the instruction is basically fixed at 32 bits,and the instruction is aligned in 4-byte (32-bit) boundary. Each 32-bitinstruction code comprises 2 format specification bits (FM bits) 51indicative of the format of the instruction, a 15-bit left-handcontainer 52, and a 15-bit right-hand container 53. Each of thecontainers 52 and 53 may store therein a 15-bit short-formatsub-instruction. Further, the containers 52 and 53 together may storetherein a 30-bit long-format sub-instruction. For purposes ofsimplification, the short-format sub-instruction and long-formatsub-instruction are referred to hereinafter as a short instruction and along instruction, respectively.

The FM bits 51 may specify the format of the instruction and the orderof two short instructions to be executed. If the FM bits 51 are “11”,the FM bits 51 indicate that the containers 52 and 53 hold the longinstruction. If they are not “11”, the FM bits 51 indicate that each ofthe containers 52 and 53 holds the short instruction. If the indicationis that two short instructions are held, the FM bits 51 specify theorder of execution. If the FM bits 51 are “00”, the FM bits 51 indicatethat two short instructions are executed in parallel. If they are “01”,the FM bits 51 indicate that the short instruction held in theright-hand container 53 is executed after the short instruction held inthe left-hand container 52 is executed. If they are “10”, the FM bits 51indicate that the short instruction held in the left-hand container 52is executed after the short instruction held in the right-hand container53 is executed. In this manner, the first preferred embodiment allowsencoding into one 32-bit instruction including two short instructions tobe executed sequentially, improving encoding efficiency.

FIGS. 4 to 7 illustrate typical instruction encodings. FIG. 4 shows theinstruction encoding of a short instruction having two operands. Fields61 and 64 are operation code fields. The field 64 specifies anaccumulator number in some cases. Fields 62 and 63 specify the positionsto hold the operand value by using a register number or an accumulatornumber. The field 63 specifies a 4-bit short immediate value in somecases. FIG. 5 shows the instruction encoding of a short-format branchinstruction including an operation code field 71 and an 8-bit branchdisplacement field 72. The branch displacement is specified by a word(32 bits) offset, like the PC value. FIG. 6 shows a format of a3-operand instruction having a 16-bit displacement or immediate value ora load/store instruction which includes an operation code field 81,fields 82 and 83 for specifying a register number like the short format,and an extended data field 84 for specifying the 16-bit displacement orimmediate value. FIG. 7 shows a format of an instruction having anoperation code in its right-hand container 53 wherein a 2-bit field 91indicates “01”. The reference numerals 93 and 96 designate operationcode fields, and 94 and 95 designate fields for specifying a registernumber or the like. The reference numeral 92 designates reserved bitsused for the operation code or register number as required.

Further, there are provided some operations having special instructionencodings, for example, an instruction wherein all 15 bits constitute anoperation code such as a NOP (no operation) instruction, and aone-operand instruction.

Sub-instructions for the data processor of this preferred embodiment area RISC-like instruction set. Only the load/store instruction accessesthe memory data, and the operation instruction performs an arithmeticoperation on an operand in the register/accumulator or using animmediate operand. There are five operand data addressing modes: aregister indirect mode, a register indirect mode with post-increment, aregister indirect mode with post-decrement, a push mode, and a registerrelative indirect mode whose mnemonics are “@Rsrc”, “@Rsrc+”, “@Rsrc−”,“@-SP”, “@(disp16, Rsrc)”, respectively, when Rsrc is a register numberfor specifying a base address, and disp16 is the 16-bit displacementvalue. The address of the operand is specified by a byte address.

All of the modes except the register relative indirect mode have theinstruction format shown in FIG. 4. The field 63 specifies a baseregister number, and the field 62 specifies the number of a registerinto which a value loaded from the memory is written or the number of aregister for holding the value to be stored. In the register indirectmode, the value of the register specified as the base register serves asthe operand address. In the register indirect mode with post-decrement,the value of the register specified as the base register serves as theoperand address, and the value in the base register is post-incrementedby the size (the number of bytes) of the operand and written back. Inthe register indirect mode with post-decrement, the value of theregister specified as the base register serves as the operand address,and the value in the base register is post-decremented by the size (thenumber of bytes) of the operand and written back. The push mode isusable only when the store instruction is provided and the base registeris the register R15. In the push mode, the stack pointer (SP) valuepre-decremented by the size (the number of bytes) of the operand servesas the operand address, and the decremented value is written back to theSP.

The register relative indirect mode has the instruction format shown inFIG. 6. The field 83 specifies a base register number, and the field 82specifies the number of a register into which the value loaded from thememory is written or the number of a register for holding the value tobe stored. The field 84 specifies a displacement value for the positionat which the operand is stored from the base address. In the registerrelative indirect mode, the 16-bit displacement value added to the valuein the register specified as the base register serves as the operandaddress.

The post-increment type register indirect mode and the post-decrementtype register indirect mode may use a modulo addressing mode by settingthe MD bit 44 in the PSW 21 to “1”.

Jump target addressing of a jump instruction includes register indirectaddressing for specifying the jump target address by using a registervalue, and PC relative indirect addressing for specifying the jumptarget address by using a branch displacement of the jump instructionfrom the PC. PC relative indirect addressing includes short formataddressing for specifying the branch displacement when using 8 bits, andlong format addressing for specifying the branch displacement when using16 bits. Further, the data processor has a repeat instruction whichachieves loop processing without overhead.

FIG. 8 is a functional block diagram of a data processor 100 accordingto the first preferred embodiment of the present invention. The dataprocessor 100 comprises an MPU core 101, an instruction fetch unit 102for accessing instruction data in response to a request from the MPUcore 101, an internal instruction memory 103, an operand access unit 104for accessing operand data in response to a request from the MPU core101, an internal data memory 105, and an external bus interface unit 106for arbitrating external memory requests from the instruction fetch unit102 and operand access unit 104, where the external memory requestsaccess a memory external to the data processor 100.

The MPU core 101 includes an instruction queue 111, a control unit 112,a register file 115, a first operation unit 116, a second operation unit117, and a PC unit 118.

The instruction queue 111 holds 2 entries of 32-bit instruction codesand a valid bit, and is controlled in a FIFO (first-in first-out) order.The instruction queue 111 temporarily holds instruction data fetched bythe instruction fetch unit 102 to transmit the instruction data to thecontrol unit 112.

The control unit 112 performs all control of the MPU core 101, forexample, control of the instruction queue 111, pipeline control,execution of instructions, and interface with the instruction fetch unit102 and operand access unit 104. The control unit 112 includes aninstruction decode unit 119 for decoding an instruction code transmittedfrom the instruction queue 111. The instruction decode unit 119 includestwo decoders, that is, first and second decoders 113 and 114. The firstdecoder 113 decodes the instruction to be executed in the firstoperation unit 116, and the second decoder 114 decodes the instructionto be executed in the second operation unit 117. In a first cycle ofdecoding of a 32-bit instruction, the first decoder 113 analyzes aninstruction code in the left-hand container 52, and the second decoder114 analyzes an instruction code in the right-hand container 53. Thedata in the FM bits 51, and the bit 0 and bit 1 of the left-handcontainer 52 are analyzed by both of the first and second decoder 113and 114. The data in right-hand container 53 is sent to the firstdecoder 113 to extract the extended data but is not analyzed. Thus, theinstruction to be executed first must be located in a positioncorresponding to a operation unit for executing the instruction. If twoshort instructions are executed in sequential order, the instruction tobe executed later is transmitted to both of the first and seconddecoders 113 and 114, and an executable decode result becomes valid. Ifan instruction is executable by both of the first and second decoders113 and 114, only the decoded result in the first decoder 113 isvalidated, and the decoded result in the second decoder 114 isinvalidated.

The register file 115 includes the registers 1 to 17 and is connected tothe first operation unit 116, second operation unit 117, and PC unit 118by a plurality of buses.

FIG. 9 is a detailed block diagram of the first operation unit 116. Thefirst operation unit 116 is connected to the register file 115 by an S1bus 301, an S2 bus 302, and an S3 bus 303 to read data from theregisters through the three buses 301, 302 and 303. The S2 bus 302 isconnected to only odd-numbered registers, and the S1 bus 301 and S2 bus302 together may transmit 2-word data from the pair of registers inparallel. The first operation unit 116 is also connected to the registerfile 115 through a D1 bus 311, a W1 bus 314, and a W2 bus 315 to writedata into the registers through the three buses 311, 314 and 315. The W1bus 314 is connected to only the even-numbered registers, and the W2 bus315 is connected to only the odd-numbered registers. The W1 bus 314 andW2 bus 315 together may transmit 2-word data to the pair of registers inparallel.

An AA latch 151 and an AB latch 152 are input latches for an ALU 153.The AA latch 151 receives a register value read through the S1 bus 301or S3 bus 303 and has a zero clear function. The AB latch 152 receives aregister value read through the S3 bus 303 or a 16-bit immediate valuegenerated by decoding in the first decoder 113, and has a zero clearfunction. The ALU 153 mainly performs transfer, comparison, arithmeticand logic operations, calculation/transfer of operand addresses,increment/decrement of the base address values of the operand addresses,and calculation/transfer of the jump target addresses. The results ofoperations and address modifications are written back to the registerspecified by the instruction in the register file 115 through a selector155 and the D1 bus 311. An AO latch 154 is a latch for holding operandaddresses, and selectively holds and outputs the result of addresscalculation in the ALU 153 or the base address value held in the AAlatch 151 to the operand access unit 104 through an OA bus 321. Forcalculation or transfer of the jump target address, the output from theALU 153 is transferred to the PC unit 118 through a JA bus 323.

An MOD_S 156 and an MOD_E 157 are control registers corresponding to thecontrol registers CR10 (28) and CR11 (29) of FIG. 1, respectively. Acomparator 158 compares the value in the MOD_E 157 with the base addressvalue on the S3 bus 303. With the modulo addressing enabled in theregister indirect mode with post-increment/decrement, the value in theMOD_S 156 which is held in the latch 159 is written back to the baseaddress register in the register file 115 through the selector 155 andthe D1 bus 311.

A store data (SD) register 160 includes two 16-bit registers andtemporarily holds store data outputted to the S1 bus 301 or to both ofthe S1 bus 301 and S2 bus 302. Data held in the SD register 160 istransferred to an alignment circuit 162 through a latch 161. Thealignment circuit 162 aligns the data into 32-bit form in accordancewith the operand address to output the data to the operand access unit104 through a latch 163 and an OD bus 322.

The data loaded by the operand access unit 104 is applied to a load data(LD) register 164 including two 16-bit registers through the OD bus 322.The value in the LD register 164 is transferred to an alignment circuit166 through a latch 165. The alignment circuit 166 aligns the data tooutput data to be transferred to the even-numbered registers to the W1bus 314 and data to be transferred to the odd-numbered registers to theW2 bus 315. When 1-word data are loaded, load data are outputted to oneof the W1 bus 314 and the W2 bus 315. When 2-word data are loaded, theload data are outputted to both of the W1 bus 314 and the W2 bus 315.The outputted data are written into the specified register in theregister file 115.

A PSW 171 in the control unit 112 is a register for holding the value inthe control register CR0 (21) of FIG. 1. A PSW updating unit 172,including a latch, updates the value in the PSW 171 in response to theresult of an operation or by the execution of an instruction. Totransfer a value to the PSW 171, only assigned bits are transferred fromthe AB latch 152 to the control unit 112. To read a value from the PSW171, the value is outputted from the PSW updating unit 172 to the D1 bus311 and is written to the register file 115. A BPSW 167 is a registercorresponding to the control register CR1 (22) of FIG. 1. Duringexception processing, the value in the PSW 21 outputted to the D1 bus311 is written to the BPSW 167. The value in the BPSW 167 is read to theS3 bus 303 and transferred to the PSW 171 or the register file 115.

FIG. 10 is a detailed block diagram of the PC unit 118. An instructionaddress (IA) register 181 holds the address of the next instruction tobe fetched and outputs the address of the next instruction to theinstruction fetch unit 102. When a subsequent instruction is to befetched, the address value transferred from the IA register 181 througha latch 182 is incremented by 1 in an incrementor 183 and then writtenback to the IA register 181. If the sequence is changed by a jump orrepeat, the IA register 181 receives the jump target address transferredby the JA bus 323.

RPT_S 184, RPT_E 186, and RPT_C 188 are repeat control registers andcorrespond to the control registers CR8 (26), CR9 (27), and CR7 (25) inthe register set of FIG. 1, respectively. RPT_E 186 holds the address ofthe last instruction in the block to be repeated. The last address iscalculated in the first operation unit 116 during repeat instructionprocessing and applied to RPT_E 186 through the JA bus 323. A comparator187 compares the value the value of the end address in the repeat blockheld in RPT_E with the value of a fetch address held in the IA register181. If the value in RPT_C 188 for holding a repeat count is not “1”during repeat processing and the two addresses coincide with each other,a start address of the block to be repeated which is held in RPT_S 184is transferred to the IA register 181 through a latch 185 and the JA bus323. Each time the instruction at the last address of the block to berepeated is executed, the value in RPT_C 188 is decremented by 1 by adecrementor 190 through a latch 189. If the result of decrement equalszero, the RP bit 43 in the PSW 21 is cleared and the repeat processingis terminated. RPT_S 184, RPT_E 186, and RPT_C 188 have an input portfrom the D1 bus 311 and an output port to the S3 bus 303. By using thesebuses, initialization caused by repeat instruction processing, andsaving and returning operations are performed.

An execution stage PC (EPC) 194 holds the PC value of the instructionbeing executed, and a next instruction PC (NPC) 191 calculates the PCvalue of the instruction to be executed next. If a jump occurs duringexecution, the NPC 191 receives the value on the JA bus 323 to which thejump target address is transferred. If a branch occurs during a repeat,the NPC 191 receives the first address in the block to be repeated fromthe latch 185. In other cases, the value in the NPC 191 transferredthrough a latch 192 is incremented by an incrementor 193 and thenwritten back to the NPC 191. In the case of a subroutine jumpinstruction, the value in the latch 192 is outputted as a return addressto the D1 bus 311 and then written back to the register R13 defined asthe link register in the register file 115. If the next instruction isto be executed, the value in the latch 192 is transferred to the EPC194. If the PC value of the instruction being executed is to be referredto, the value in the EPC 194 is outputted to the S3 bus 303 andtransferred to the first operation unit 116. A BPC 196 corresponds tothe control register CR3 (23) in the register set of FIG. 1. If anexception or interruption is detected, the value in the EPC 194 istransferred to the BPC 196 through a latch 195. The BPC 196 has an inputport from the D1 bus 311 and an output port to the S3 bus 303, andtransfer to/from the register file 115 is performed.

FIG. 11 is a detailed block diagram of the second operation unit 117.The second operation unit 117 is connected to the register file 115 byan S4 bus 304 and an S5 bus 305 to read data from the registers throughthe two buses 304 and 305. The S4 bus 304 and S5 bus 305 together maytransfer 2-word data from the pair of registers in parallel. The secondoperation unit 117 is also connected to the register file 115 by a D2bus 312 and a D3 bus 313 to write data into the registers through thetwo buses 312 and 313. The D2 bus 312 is connected to all registers, butthe D3 bus 313 is connected to only the odd-numbered registers. The D2bus 312 and D3 bus 313 together may transfer 2-word data to the pair ofregisters in parallel.

Accumulators 208 correspond to the two 40-bit accumulators A0 and A1designated as 31 and 32 in FIG. 1.

The reference numeral 201 designates a 40-bit ALU including a guard bitadder for the accumulator which is 8 bits long (bit 0 to bit 7), anarithmetic and logic unit which is 16 bits long (bit 8 to bit 23), andan adder for adding the low-order 16 bits of the accumulator which is 16bits long (bit 24 to bit 39). The ALU 201 performs addition andsubtraction of up to 40 bits and a logic operation of 16 bits.

An A latch 202 and a B latch 203 are input latches for the ALU 201. TheA latch 202 receives the data on the S4 bus 304 at the bit 8 to bit 23positions, receives the value in the accumulator 208 intactly through ashifter 204, or receives the value in the accumulator 208 arithmetically16 bits right-shifted through the shifter 204. A shifter 205 receivesthe value in the accumulator 208 through an interconnecting line 206 (8guard bits), the S4 bus 304 (high-order 16 bits) and the S5 bus 305(low-order 16 bits) or receives the value in the register subjected tosign extension into 40 bits through only the S5 bus 305 or through theS4 and S5 buses 304 and 305. Then, the shifter 205 receives the valuearithmetically shifted by any amount ranging from 3 bits left-shift to 1bit right-shift. The B latch 203 receives the data on the S5 bus 305 atthe bit 8 to bit 23 positions or receives the output from a multiplier211 through the P latch 214 or the output from the shifter 205. The Alatch 202 and the B latch 203 have the function to clear the datatherein to zero and to set the data therein to a constant value.

If a destination operand indicates the accumulator 208, the output fromthe ALU 201 is written into the accumulator 208 through a selector 207.If the designation operand indicates the register, the output from theALU 201 is written into the register file 115 through the selector 207and either the D2 bus 312 only (1-word data) or both of the D2 bus 312and D3 bus 313 (2-word data). A saturation circuit 209 receives theoutput from the ALU 201 and has the function of clipping its output to amaximum or minimum value expressible as 16 bits or 32 bits withreference to the guard bits to output data containing high-order 16 bitsor both high-order and low-order 32 bits. The output from the saturationcircuit 209 may be written into the register file 115 through only theD2 bus 312 (1-word data) or through both of the D2 bus 312 and D3 bus313 (2-word data). For calculation of absolute values and execution ofmaximum and minimum value setting instructions, the outputs of the Alatch 202 and the B latch 203 are connected to the input of the selector207.

A priority encoder (PENC) 210 receives the value in the B latch 203. ThePENC 210 generates the shift count value required to normalize the inputdata as fixed point format, and outputs the results to the register file115 through the D2 bus 312.

An X latch 212 and a Y latch 213 receive 16-bit values on the S4 bus 304and S5 bus 305, respectively, and have the function of zero extension orsign extension of the 16-bit values to 17 bits.

The multiplier 211 is a 17-bit×17-bit multiplier for multiplying thevalue stored in the X latch 212 by the value stored in the Y latch 213.If the multiplier 211 receives a multiply-add instruction or amultiply-subtract instruction, the result of multiplication is appliedto a P latch 214 and transmitted to the B latch 203. If the multiplier211 receives a multiply instruction and the destination operand is theaccumulator 208, the result of multiplication is written into theaccumulator 208 through the selector 207.

A barrel shifter 215 may perform an up-to-16-bits left and rightarithmetic/logic shift on 40-bit or 16-bit data. A shift data (SD) latch217 receives as shift data the value in the accumulator 208 or the valuein the register applied through the S4 bus 304. A shift count (SC) latch216 receives as a shift count the immediate value of the register valuethrough the S5 bus 305. The barrel shifter 215 performs a shiftspecified by the operation code on the data in the SD latch 217 by theshift count specified by the SC latch 216. The result of the shiftoperation is written back to the accumulator 208 or to the register filethrough the D2 bus 312. The shifter 215 has a 2-word transfer function.Specifically, the shifter 215 outputs the 2-word data received throughthe S4 bus 304 and S5 bus 305 to the D2 bus 312 and D3 bus 313 throughthe SD latch 217 and shifter 215 to write back the 2-word data into theregister file 115. The shifter 215 may perform 1-word transfer.

An immediate value latch 218 extends a 6-bit immediate value generatedby the second decoder 114 into a 16-bit value and then holds the 16-bitvalue to transfer the 16-bit value to an arithmetic unit through the S5bus 305.

Pipeline processing in the data processor will be described belowaccording to the first preferred embodiment of the present invention.FIG. 12 illustrates the pipeline processing. The data processor of thefirst preferred embodiment performs 5-stage pipeline processingincluding an instruction fetch (IF) stage 401 for fetching instructiondata, an instruction decode (D) stage 402 for analyzing instructions; aninstruction execution stage (E) 403 for executing operations; a memoryaccess (M) stage 404 for accessing a data memory, and a write back (W)stage 405 for writing operands loaded from a memory into a register. Formultiply-add/multiply-subtract operations, further 2-stage pipelineincluding multiplication and addition is used to execute instructions.The latter stage processing is referred to as an instruction execution 2(E2) stage 406.

At the IF stage 401, a fetch of instructions, management of theinstruction queue 111, and repeat control mainly are performed. The IFstage 401 controls the operations of the instruction fetch unit 102, theinternal instruction memory 103, the external bus interface unit 106,the instruction queue 111, the IA register 181, latch 182, incrementor183 and comparator 187 in the PC unit 118, and units for performing thestage control, the instruction fetch control, control of PC unit 118 andcontrol of the instruction queue 11 in the control unit 113. The IFstage 401 is initialized by a jump at the E stage 403.

The fetch address is held in the IA register 181. If a jump occurs atthe E stage 403, the IA register 181 receives the jump target addressthrough the JA bus 323 to perform initialization. When the instructiondata are fetched sequentially, the incrementor 182 increments theaddress. The sequence switching control is performed if the comparator187 detects a coincidence between the value in the IA register 181 andthe value in the RPT_E 186 during repeat processing and the value in theRPT_C 188 is not “1”. Then, the value held in the RPT_S 184 istransferred to the IA register 181 through the latch 185 and JA bus 323.

The value in the IA register 181 is sent to the instruction fetch unit102 which in turn fetches the instruction data. If the correspondinginstruction data are stored in the internal instruction memory 103, aninstruction code is read from the internal instruction memory 103. Inthis case, the instruction fetch is completed within one clock cycle. Ifthe corresponding instruction data are not stored in the internalinstruction memory 103, an instruction fetch request is sent to theexternal bus interface unit 106. The external bus interface unit 106arbitrates between the instruction fetch request and a request from theoperand access unit 104. When the external bus interface unit 106accepts the instruction fetch request from the instruction fetch unit102, the external bus interface unit 106 reads out the instruction datafrom an external memory, and transmits the fetched instruction to theinstruction fetch unit 102. The external bus interface unit 106 requiresa minimum of 2 clock cycles to access the external memory. Theinstruction fetch unit 102 transfers the received instruction to theinstruction queue 111. The instruction queue 111 is a 2-entry queue andoutputs the instruction code received under FIFO control to theinstruction decoders 113 and 114.

At the D stage 402, the instruction decode unit 119 analyzes operationcode and generates execution control signals to control the firstoperation unit 116, the second operation unit 117, and the PC unit 188to execute instructions. The D stage 402 is initialized by a jump at theE stage 403. If the instruction code sent from the instruction queue 111is invalid, the D stage 402 is placed in an idle cycle and waits for anvalid instruction code to be received. If the E stage 403 is notpermitted to start the next processing, the execution control signalsare invalidated, and the D stage 403 waits for the termination ofprocessing of the preceding instruction at the E stage 403. For example,such a condition occurs when the instruction being executed at the Estage 403 is a memory access instruction and the preceding memory accessis not terminated at the M stage 404.

The D stage 402 also performs division of two instructions to besequentially executed, sequence control of a 2-cycle executioninstruction, a conflict check on a load operand using a scoreboardregister (not shown), and a conflict check on a operation unit in thesecond operation unit 117. If any of these conflicts are detected, theoutput of the control signal is inhibited until the conflict iscancelled. FIG. 13 illustrates an example of the load operand conflict.If immediately after a load instruction a multiply-add operation refersto an operand to be loaded by the load instruction, the start ofexecution of the multiply-add operation instruction is inhibited untilthe load to the register is completed. In this case, a 2-clock-cyclestall occurs if the memory access is terminated within one clock cycle.FIG. 14 illustrates an example of an hardware resource conflict on thesecond operation. If a rounding instruction which uses an adder isimmediately after a multiply-add operation instruction, the start ofexecution of the rounding instruction is inhibited until the operationof the preceding instruction is terminated. In this case, a1-clock-cycle stall occurs. No stalls occur if the multiply-addoperation instructions are executed successively.

The first decoder 113 mainly generates operation control signals forcontrol of the: first operation unit 116, control of parts of the PCunit 118 which are not controlled by the IF stage 401, read control ofthe register file 115 to the S1 bus 301, S2 bus 302, and S3 bus 303, andwrite control thereof from the D1 bus 311. The first decoder 113 alsogenerates the instruction-dependent information to be used in the Mstage 404 and the W stage 405, and this control information is sentthrough the pipeline. The second decoder 114 mainly generates anexecution control signals in the second operation unit 117.

The E stage 403 performs processing of almost all instruction executionsexcept the memory access and addition of themultiply-add/multiply-subtract operation instructions, such as anarithmetic operation, comparison, data transfer between registersincluding control registers, operand address calculation of theload/store instructions, calculation of the jump target address of thejump instruction, jump processing, EIT (exception, interruption, trap)detection, and jump to an EIT vector table.

With interrupts enables, an interrupt is detected at the end of a 32-bitinstruction without fail. No interrupt is serviced between two shortinstructions to be sequentially executed in the 32-bit instruction.

The completion of the execution of the E stage 403 must stall when theinstruction being processed at the E stage 403 is an operand accessinstruction and a preceding memory access at the M stage 404 has notcompleted. Stage control is performed in the control unit 112.

At the E stage 403, the first operation unit 116 performs arithmetic andlogic operations, comparison, and transfer. The ALU 153 calculates theaddress of the memory operand including modulo control and a branchtarget address. The value in the register specified as the operand by aninstruction is transferred to the first operation unit 116. Extendeddata, such as an immediate or displacement value, is also transferred tothe first operation unit 116 from the first decoder 113 if necessary.Arithmetic and logical operations are performed by ALU 153, and anoperation result is written back to the register file 115 through the D1bus 311. If the load/store instruction is provided, the result of thearithmetic operation is transmitted to the operand access unit 104through the AO latch 154 and OA bus 321. If the jump instruction isprovided, the jump target address is transmitted to the respective unitsthrough the JA bus 323. The store data are read from the register file115 through the S1 bus 301 and S2 bus 302 and are held and aligned.Then, the store data are transferred to the operand access unit 104through the OD bus 322. The PC unit 118 manages the PC value of theinstruction being executed and calculates the next instruction address.Data transfer between the control register (except the accumulator) andthe register file 115 is carried out by both of the first operation unit116 and the PC unit 118.

At the E stage 403, the second operation unit 116 executes alloperations except addition of the multiply-add operation, such asarithmetic and logic operations, comparison, transfer, and shift. Thevalue of an operand is transferred from the register file 115, immediatevalue register 218, and accumulator 208 to respective operation unitsthrough the S4 bus 304, S5 bus 305 and other exclusive paths, and issubjected to a specified operation. The result of the operation iswritten back to the accumulator 208, or to the register file 115 throughthe D2 bus 312 and the D3 bus 313.

The control signal generated in the second decoder 114 for execution ofthe addition and subtraction of the multiply-add/multiply-subtractoperation is held under control of the E stage 403.

In the M stage 404, operand memory access is performed according to theaddress sent from the first operation unit 116. The operand access unit104 reads/writes data from/to the internal data memory 105 or an on-chipIO (not shown) in one clock cycle if the operand is in the internal datamemory 105 or the on-chip IO. The operand access unit 104 outputs a dataaccess request to the external bus interface unit 106 if the operand isnot in the internal data memory 105 or the on-chip IO (not shown). Theexternal bus interface unit 106 accesses data in the external memory,and transfers the read data to the operand access unit 104 if a loadinstruction is executed. The external bus interface unit 106 requires aminimum of two clock cycles to access the external memory. If the loadinstruction is executed, the operand access unit 104 transfers the readdata to the LD register 164 through the OD bus 322. The M stage 404control is performed in the control unit 112.

In the W stage, alignment of loaded operands, zero/sign extension ofbyte data, and writing to the register file 115 are performed.

At the E2 stage 406, the ALU 201 executes the addition and subtractionof the multiply-add/multiply-subtract operation.

The data processor of this preferred embodiment uses for internalcontrol an a clock signal generated by multiplying an input clock signalby four for an internal clock signal. Each of the pipeline stagerequires a minimum of one internal clock cycle to terminate processingthereof. The details of the clock control are not directly related tothe present invention and hence are not described.

An example of processing of the respective sub-instructions is discussedbelow. The processing of operation instructions such as addition,subtraction, logic operation, and comparison, and register-to-registertransfer instructions is terminated in three stages: the IF stage 401,the D stage 402, and the E stage 403. The operations and data transferare executed at the E stage 403.

The multiply-add/multiply-subtract instruction requires 2 clock cyclesfor execution of multiplication at the E stage 403 and addition andsubtraction at the E2 stage 406, that is, substantially 4-stageprocessing.

The load instruction requires five stages: the IF stage 401, the D stage402, the E stage 403, the M stage 404, and the W stage 405 to terminatethe processing. The store instruction requires four stages: the IF stage401, the D stage 402, the E stage 403, and the M stage 404 to terminatethe processing.

An instruction which requires 2 cycles for execution directs that thefirst and second instruction decoders 113 and 114 perform the processingin two cycles. Each of the first and second instruction decoders 113 and114 outputs an execution control signal for each cycle, and executes theoperation in two cycles.

One long instruction performs the above described processing. Twoinstructions to be executed in parallel perform the above describedprocessing in accordance with the instruction that takes a greaternumber of clock cycles to execute the instruction in the E stage 403.For example, a combination of the instruction to be executed in twocycles and the instruction to be executed in one cycle requires twocycles. Two short instructions to be executed sequentially are decodedsequentially in the D stage 402 and executed sequentially in the E stage403. For example, two addition instructions to be terminated at the Estage 403 are divided into respective instruction processes at the Dstage 402 and executed over 2 cycles at the E stage 403.

An example of processing is described below on the basis of someprograms.

FIG. 15 illustrates an exemplary program of a 256-tap FIR (finiteimpulse response) filter (frame processing) of the data processoraccording to the first preferred embodiment. The symbol “| |” in FIG. 15indicates that two short instructions are executed in parallel. The FIRfilter executes the following calculation:$\sum\limits_{i = 0}^{255}\left( {{A\lbrack i\rbrack}*{D\lbrack i\rbrack}} \right)$

where A[i] is a coefficient array and D[i] is a data array. Thiscalculation includes 256 multiply-add operations. The coefficient andthe data each are 16 bits in length.

In FIG. 15, initialization is designated at 501, loop processing at 502,and post-processing at 503. The loop processing without overhead isimplemented by a repeat (repi) instruction. A block of 6 instructionsbetween the instruction next to the repi instruction and the instructionspecified by the label “loopend” is executed 42 times. The repiinstruction is a long instruction including an operation code, a 16-bitdisplacement for specifying the last address of the repeat block in thePC relative mode, and an 8-bit immediate value for specifying the repeatcount, and requires two clock cycles for execution. In the first cycle,the instruction address next to the repi instruction is transferred fromthe latch 192 to RPT_S 184 and the latch 185 through the D1 bus 311. Theaddress of the repi instruction is transferred from the EPC 194 throughthe S3 bus 303 to the AA latch 151, and the displacement value specifiedby the instruction is applied from the first decoder 113 to the AB latch152. The ALU 153 adds the data in the AA latch 151 and AB latch 152together to transfer the result of the addition which is the lastinstruction address of the block to be repeated to RPT_E 186 through theJA bus 323. In the second cycle, the 8-bit immediate value which iszero-extended into 16 bits is applied from the first decoder 113 to theAB latch 152 and is then transferred to RPT_C 188 through the ALU 153and D1 bus 311. The RP bit 43 in the PSW 21 is set to “1”. In thismanner, initialization required for repeat processing is terminated. Theregisters R0 to R5 are used as a buffer for data; the registers R6 toR11 are used as a buffer for coefficients; the register R12 is used as adata pointer; and the register R14 is used as a pointer forcoefficients.

The processing in the loop is described in detail hereinafter. Eachinstruction includes the load instruction and the multiply-add operationinstruction, and the two short instructions are executed in parallel. InFIG. 15, the “LD2W Rdest, @Rsrc+” indicates that 2-word (32-bit) dataare fetched using the contents of the register specified by Rsrc as anoperand address, and the fetched operand value is written to a pair ofregisters specified by Rdest (e.g., a pair of registers R0 and R1 whenRdest indicates R0). The value of Rsrc plus 4 (byte size of the operand)is written back. “MAC Adest, Rsrc1, Rsrc2” indicates the multiply-addoperation instruction. The value in the register specified by Rsrc1 andthe value in the register specified by Rsrc2 are multiplied together assigned values, and the result of multiplication is added to the value inthe accumulator specified by Adest. The result of the addition iswritten back to the accumulator. FIG. 16 illustrates a bit pattern whenthese two instructions are executed in parallel. These instructions areallocated as instructions corresponding to the bit allocation of theshort instruction having two operands of FIG. 4. The reference numeral521 designates FM bits which are “00” since two instructions areexecuted in parallel. The reference numerals 522 and 525 designateoperation codes of an LD2W instruction with post-increment, and 526designates an operation code of a MAC instruction. The referencenumerals 523, 524, 527, and 528 designate areas for specifying theregister numbers of Rdest, Rsrc, Rsrc1, and Rsrc2 for holding operands,respectively. Rdest may specify only even-numbered registers. Thereference numeral 529 designates an area Ad for specifying theaccumulator number of Adest. FIG. 17 illustrates the contents of theinternal instruction memory corresponding to the loop part. Forsimplicity, the contents of the memory are expressed as mnemonics. The32-bit instructions are referred to as I1 (511), I2 (512) and the like,and the short instructions are referred to as I1a (512(a)), I1b (512(b))and the like. The six instructions I1 (511) to I6 (516) are repeatedlyexecuted 42 times in the loop.

FIG. 18 illustrates mapping of the internal data memory with respect tothe coefficients A[i] and data D[i]. Each area of the internal datamemory holds 256 data entries. For the coefficients A[i], 16-bit dataare held in 256-word (512-byte) areas at the addresses 2000 to 21ff inhexadecimal. For the data D[i], 16-bit data are held in 256-word(512-byte) areas at the addresses 2400 to 25ff in hexadecimal.

FIG. 19 (FIGS. 19A to 19C) illustrates a flow of processing in the loopwherein the pipeline stages are depicted as the abscissa with time asthe ordinate. All instructions are stored in the internal instructionmemory 103, and all operand data are stored in the internal data memory105. One clock cycle is required to complete the processing at onestage. When repeating the processing continued, T1 follows T6.

An example is given below based on the processing of 11 (511). At the IFstage 401, 11 (511) is fetched during T1 period (531). The contents ofthe IA register 181 are transferred to the instruction fetch unit 102.The comparator 187 compares the value in the IA register 181 with thevalue in the RPT_E 186. Since a mismatch occurs, the value in the IAregister 181 is incremented by the incrementor 182 and written back. Theinstruction fetch unit 102 accesses the internal instruction memory 103to transmit the read 32-bit instruction data to the instruction queue111. The instruction queue 111 transmits the instruction datatransferred thereto within the same cycle to the first decoder 113 andthe second decoder 114.

At the D stage 402, 11 (511) is decoded during T2 period (537). Thefirst decoder 113 decodes the LD2W instruction of I1a (511(a)) toproduce the control signal, and the second decoder 114 decodes the MACinstruction of I1b (511(b)) to produce the control signal. The controlsignals from the first and second decoders 113 and 114 are outputted tothe first and second operation units 116 and 117, respectively. Theimmediate value which is “4” is transmitted to the first operation unit116. At the D stage 401, a conflict check is performed on the operandsand arithmetic units, but no interference occurs. The value of the readoperand of the MAC instruction has already been loaded. The MACinstruction of I1b (511(b)) is executed during T3 period (543). Writinga value to the register R0 has been completed during T1 period (535),and writing a value to the register R6 has been completed during T2period (540).

At the E stage 403, 11 (511) is executed during T3 period (543). Thefirst operation unit 116 produces the operand address of the LD2Winstruction of I1a (511(a)) and updates the value of the addresspointer. The value in the register R12 of the register file 115 which isthe operand address is transferred through the S3 bus 303 to the AAlatch 151. The value in the AA latch 151 is intactly outputted to theoperand access unit 104 through the AO latch 154 and OA bus 321. Theimmediate value outputted form the first decoder 113 is transferred tothe AB latch 152. The ALU 153 performs addition to write back the resultof addition to the register R12 of the register file 115 through theselector 155 and D1 bus 311. The second operation unit 117 performsmultiplication of the MAC instruction of I1b (511(b)). The value in theregister R0 of the register file 115 is applied to the X register 212through the S4 bus 304, and the value in the register R6 of the registerfile 115 is applied to the Y register 213 through the S5 bus 305. Bothof the values are handled as signed values and multiplied together. Theresult of multiplication is applied to the P register 214.

At the M and E2 stages 404 and 406, 11 (511) is subjected to memoryaccess and addition during T4 period (549), respectively. The operandaccess unit 104 loads the 4-byte data from the internal data memory 105at the address transmitted from the first operation unit 116 to transferthe fetched value to the LD register 164 through the OD bus 322. Thesecond operation unit 117 performs addition of the MAC instruction ofI1b (511(b)). The value in the accumulator 208 (A0) is not shifted bythe shifter 204 but is applied to the A latch 202. The value in the Pregister 214 is subjected to sign extension into 40 bits and applied tothe B latch 203. The ALU 201 adds the values in the A and B latches 202and 203 together to write back the result of addition to the accumulator208 (A0).

At the W stage 405, writing back to the register is performed during T5period (555). The first operation unit 116 outputs the data held in theLD register 164 to the W1 bus 314 and W2 bus 315 through the latch 161and alignment circuit 166. Since the operand is aligned in 4 bytes, thehigh-order 2 bytes are outputted to the W1 bus 314, and the low-order 2bytes are outputted to the W2 bus 315. The data on the W1 bus 314 arewritten into the register R4 of the register file 115, and the data onthe W2 bus 315 are written into the register R5 of the register file115.

The pipeline processing of one 32-bit instruction for each clock cycleachieved one multiply-add operation for each clock cycle. In thismanner, the loaded data value is written two words at a time into theregisters R0 and R1 of the register file 115 during T1 period 535 at theW stage 405, and the loaded coefficient value is written two words at atime into the registers R6 and R7 of the register file 115 during T2period 540 at the W stage 405. The value in the register R0 of theregister file 115 for holding the data and the value in the register R6of the register file 115 for holding the corresponding coefficient arereferred to and multiplied together during T3 period 543 at the E stage403. The value in the register R1 of the register file 115 for holdingthe data and the value in the register R7 of the register file 115 forholding the corresponding coefficient are referred to and multipliedtogether during T4 period 548 at the E stage 403. Such pipelineprocessing by means of software without operand interference improvesefficiency.

The repeat processing is not illustrated in detail in FIG. 19, but isdescribed below briefly. At the IF stage 401, the fetch address iscompared with the last address of the repeat block. A match occursduring T6 period 556 at the IF stage 401. The counter value of RPT_C 188which is not “1” indicates further repetition of the processing, and theprocessing sequence is changed. The value in the latch 185 which is thestart address of the block to be repeated is transferred to the IAregister 181 through the JA bus 323. The instruction at the startaddress of the block to be repeated is fetched during T7 (not shown)period corresponding to T1 period 531 at the IF stage 401. The lastinstruction address match detection result of the repeat block istransferred through the pipeline. During T2 period 538 at the E stage403 wherein 16 is executed, the decrementor 190 decrements the value inthe repeat counter RPT_C 188 by one independently of the instruction tobe executed. If the value in RPT_C 188 before decrement is “1”, the RPbit in the PSW is cleared to zero, and the repeat is disabled.

An n-stage second-order direct-form type II (n biquad) IIR (infiniteimpulse response) filter is described below. FIG. 20 illustrates thesignal flow graph of the filter. In FIG. 20, 601 represents themultiplication of coefficients, 602 represents the addition of inputdata, and 603 represents a unit delay. Five multiply-add operations areexecuted within the loop. In this case, it is necessary to write thedata of last two operations into the memory during the loop period. Thedata and coefficient each are 16 bits in length. FIG. 21 illustrates anexample of a program of the IIR filter for the data processor of thefirst preferred embodiment. Initialization is performed at 606, the loopprocessing at 607, and the post-processing at 608. The loop processingwithout overhead is implemented by the repeat instruction.

Processing in the loop is described in detail herein. FIG. 22illustrates the contents of the internal instruction memorycorresponding to the loop part. Six instructions I1 (611) to I7 (617)are executed repeatedly 42 times within the loop. An ST2W instruction at611(a) directs that the memory stores the values in the two registers R0and R1. The value in the register R12 serving as the base address ispost-incremented. An MULX instruction at 611(b) directs that the valuesin the two registers R0 and R6 as signed values are multiplied togetherand the result of multiplication is written back into the accumulator.The result of multiplication is written back into the accumulator 208within the same cycle through the selector 207. An ADD instruction at612(a) directs that the ALU 201 adds 4 to the contents of the registerR12. An MV instruction at 613(b) directs that the value of r4 is copiedto r1. An MV2W instruction at 616(b) transfers two words, that is,transfers the values in the registers R4 and R5 to the registers R2 andR3, respectively. Since the data are transferred through the barrelshifter 215 in this case, a hardware resource does not interfere withthe immediately preceding multiply-add operation instruction. An RACHIinstruction at 617(b) directs that the value in the accumulator A0 is 1bit left-shifted, rounded to upper 16 bits, limited on the basis of thevalue of the guard bits to a maximum value h'7fff expressible in 16 bitsif an overflow occurs and to a minimum value h'8000 expressible in 16bits if an underflow occurs, and written back to the register R0. Thisoperation is executed by using the ALU 201 and saturation circuit 209.

The register R0 holds input data to the next stage. The register R1 isused as update data Di1 (i is an integer not more than n). The registersR2 to R5 are used as a buffer for holding the data. The registers R6,and R8 to R11 are used as a buffer for coefficients. The register R12 isused as a data pointer. The register R14 is used as a pointer forcoefficients. The register R7 holds invalid data to maintain 32-bitalignment of the coefficient data.

FIG. 23 illustrates mapping of the internal data memory with respect tothe coefficients and data. The respective data are 16 bits in length.Five coefficients are provided per stage and two data are provided perstage in the arrangement of FIG. 20. A coefficient Ai (621) is to bemultiplied by an input value. The reference numerals 622 and 626designate dummy areas for efficient access to the coefficients.

FIG. 24 (FIGS. 24A to 24C) illustrates a flow of processing within theloop wherein the pipeline stages are depicted as the abscissa with timeas the ordinate. All instructions are stored in the internal instructionmemory 103, and all operand data are stored in the internal data memory105. One clock cycle is required to complete the processing at onestage. During the repeat processing continued, T1 follows T7. As is thecase with the above described FIR filter, the fetched data are referredto at least 2 cycles later. The instruction at 611 directs thatmultiplication is executed simultaneously as two words are stored fordata updating. The MV2W instruction directs that the contents of dataare transferred every two word at a time (at 643) so that the sameregister number is used each time the loop is repeated. The value in theregister R4 loaded at 659 during T4 period and the value in the registerR10 loaded at 655 during T3 period are multiplied together at 666 duringT6 period, and the value in the register R5 loaded at 659 during T4period and the value in the register R11 loaded at 655 during T3 periodare multiplied together at 670 during T7 period. Such processing allows7 cycles to be required to implement the processing of the second-orderdirect-form type-II IIR filter at one stage.

An example of IFFT (inverse fast Fourier transform) is described below.Unit processing is as follows:

tmp_r=(b_r * c_r)−(b_i * c_i);

tmp_i=(b_r * c_i)+(b_i * c_r);

A_r=a_r−tmp_r;

A_i=a_i−tmp_i;

B_r=a_r+tmp_r;

B_r=a_i+tmp_i;

where a and b are complex variables of input data, A and B are complexvariables of output data (update data), tmp is a temporary complexvariable, c is a complex constant, “_r” is a real part, and “_i” is animaginary part.

FIG. 25 illustrates an example of a program in a loop part when two unitprocessings of the IFFT form one loop. The registers R0, R1, and R2, R3hold a and A. The registers R4 and R5 hold b. The registers R6 and R7hold tmp. The registers R8 and R9 hold a and B. The registers R10 andR11 hold c. The even-numbered registers hold the real part, and theodd-numbered registers hold the imaginary part. The register R12 holdsthe address of a. The register R14 holds the address of b.

The symbol “msu” indicates the multiply-subtract instruction whichdirects that the result of multiplication is subtracted from theaccumulator. This is because the square of i (complex number) equals −1.

In this case, 15 cycles are required to implement two unit processingswith respect to two pairs of complex numbers. The thirtysub-instructions include four 2-word load instructions, four 2-wordstore instructions, four additions, four subtractions, fourmultiplications, two multiply-add operations, two multiply-subtractoperations, four rounding instructions, and two 2-word transferinstructions, providing very high efficiency of operations.

FIG. 26 illustrates an example of a program in a loop part of asubtract-absolute-add operation. An absadd instruction directs that thespecified register value is applied from the register file 115 to thelow-order positions of the shifter 205 through the S5 bus 305. Theshifter 205 performs sign extension on the value into 40 bits but doesnot shift the value to apply the value to the B latch 203. The value issubjected to the operation with the value in the accumulator 208specified by the ALU 201, and the result of the operation is writtenback to the accumulator 208. The ALU 201 performs addition when thevalue in the B latch 203 is positive, and performs subtraction when thevalue is negative. The subtraction is implemented by inverting the dataand providing a carry to the least significant bit. In this manner, oneclock cycle is required to implement the absolute-add operation.

FIG. 27 (FIGS. 27A to 27C) illustrates a flow of processing in the loop.The 2-word load instruction is executed in parallel with theabsolute-add instruction of the result of subtraction. During T3 cycle,data are loaded to the registers R6 and R7 in response to the LD2Winstruction of I5a (813). Then, during T4 cycle, data are loaded to theregisters R2 and R3 in response to the LD2W instruction of I6a (817).During T6 cycle, the value in the register R7 is subtracted form thevalue in the register R3 in response to the instruction indicated byI4a, and the result of subtraction is written back to the register R3.Further, the value in the register R6 is subtracted from the value inthe register R2 in parallel in response to the instruction indicated byI4b, and the result of subtraction is written back to the register R2(824). In this manner, four subtract-absolute-add operations areimplemented in 6 cycles.

Second Preferred Embodiment

FIG. 28 is a block diagram of a second operation unit 120 for the dataprocessor according to a second preferred embodiment of the presentinvention corresponding to the second operation unit 117 of the firstpreferred embodiment. Other units of the data processor of the secondpreferred embodiment are similar in construction to those of the firstpreferred embodiment. The second operation unit 120 differs from thesecond operation unit 117 of the first preferred embodiment in that itincludes an ALU 221 operable independently of an adder 231 forperforming the multiply-add operation. This allows the execution of theaddition and subtraction of the multiply-add/multiply-subtractinstructions and other arithmetic and logic operations withoutinterference of hardware.

The ALU 221 performs a 16-bit arithmetic and logic operation. An A2latch 222 connected to the S4 bus 304 and a B2 latch 223 connected tothe S5 bus 305 are input latches for the ALU 221. An ALUO latch 225 isan output latch for the ALU 221. A selector 224 selects the output fromthe ALU 221, the value in the A2 latch 222, or the value in the B2 latch223 to write back the selected value to the register file 115 throughthe D2 bus 312. The output from the ALUO latch 225 may be set to thelow-order positions and subjected to sign extension by a shifter 235. AB latch 233 selectively receives the output from the shifter 235 or thevalue in the P latch 214 serving as the output latch of the multiplier211. An A latch 232 receives data from the accumulator 208 through theshifter 204. Other elements of the second operation unit 120 aresubstantially identical with those of the second operation unit 117 ofthe first preferred embodiment.

An example of processing is described below. FIG. 29 illustrates anexemplary program of a subtract-square-add operation. Initialization isperformed at 701, the loop processing at 702, and the post-processing at703. FIG. 30 illustrates the contents of the internal instruction memorycorresponding to the loop part. The subtract-square-add operation ofD1[i] and D2[i] is performed. The register R12 holds the address ofD1[i], and the register R14 holds the address of D2[i]. The registers R0to R3 hold the data D1[i], and the registers R4 to R7 hold the dataD2[i]. All instructions are executed in parallel. Six cycles arerequired to execute the processing four times. FIG. 31 illustratesmapping of the internal data memory with respect to the data. Therespective data are 16 bits in length. The data D1[i] and D2[i] arestored in different areas.

FIG. 32 (FIGS. 32A to 32C), illustrates a flow of processing in theloop. For example, the adder 231 and the ALU 221 execute the addition(746) of the multiply-add operation and the subtraction (745) fordetermining the difference in parallel during T6 period.

The data processor of the second preferred embodiment may moreefficiently process the subtract-absolute-add operation described in thefirst preferred embodiment. FIG. 33 illustrates the contents of theinternal instruction memory corresponding to the loop part. The registerR12 holds the address of a first data array, and the registers R0 to R5hold the data thereof. The register R14 holds the address of a seconddata array, and the registers R6 to R11 hold the data thereof. An daaddinstruction is an instruction for determining the subtract-absolute-addoperation, and the result of the operation is held in the accumulator.This instruction, like the multiply-add operation instruction, directsthat two-stage pipeline processing is executed.

FIG. 34 (FIGS. 34A to 34C) illustrates a flow of processing in the loop.The processing conditions in FIG. 34 are substantially similar to thosein the case of the FIR filter described in the first preferredembodiment. The subtract-absolute-add operation is executed in FIG. 34in place of the multiply-add operation. That is, multiplication isreplaced with subtraction, and the addition is replaced with theabsolute-add operation. In this manner, the throughput of the processingis such that one subtract-absolute-add operation is executed per cycle.

Third Preferred Embodiment

FIG. 35 is a functional block diagram of the data processor according toa third preferred embodiment of the present invention. An MPU 850 is anMPU core. An instruction fetch unit 863 and an operand access unit 864are substantially similar to the instruction fetch unit 102 and operandaccess unit 104 of the data processor of the first preferred embodiment.The instruction data which are 64 bits in length are applied to theinstruction fetch unit 863. The bus interface unit and the like are notshown in FIG. 35.

The MPU core 850 comprises an instruction queue 851, a control unit 852,a register file 860, a first operation unit 858, a second operation unit859, a third operation unit 861, and a fourth operation unit 862. Theinstruction queue 851 is an FIFO-controlled instruction buffer forholding a maximum of two 64-bit instructions. The first operation unit858 includes an incrementor, a decrementor, and an adder and performsmanagement of the PC value, calculation of the branch target address,and repeat control. The second operation unit 859 includes an ALU and analignment circuit and performs operand address generation, updating ofthe pointer, arithmetic and logic operations, transfer, comparison,holding and alignment of loaded data, and holding and alignment of datato be stored. The third operation unit 861 includes an ALU and ashifter, and performs operation processing such as arithmetic and logicoperations, transfer, comparison and shift. The fourth operation unit862 includes a multiply-add operation unit, a shifter, and anaccumulator, and mainly performs the multiply-add and multiply-subtractoperations and accumulator shift. In this manner, the MPU core 850 hasthe four independent arithmetic operation units connected to theregister file.

The control unit 852 includes an instruction decode unit 853. Theinstruction decode unit 853 has four decoders. FIG. 36 illustrates aninstruction format processed by the data processor of the thirdpreferred embodiment. FM bits 871 having a 4-bit format are divided intotwo 2 bits for specifying the formats of the combination of first andsecond containers 872 and 873 and the combination of third and fourthcontainers 874 and 875 in the same manner as in the data processor ofthe first preferred embodiment. Each of the first to fourth containers872 and 875 is expressed in 15 bits.

The first decoder 854 mainly decodes the operation code of the firstcontainer 872 to produce control signals for the register file 860 andfirst operation unit 858. The branch instruction is mainly specified inthe field of the first container 872. The second decoder 855 mainlydecodes the operation code of the second container 873 to producecontrol signals for the register file 860 and second operation unit 859.The load/store instruction, arithmetic and logic operation instruction,transfer instruction, and comparison instruction are mainly specified inthe field of the second container 873. The third decoder 856 mainlydecodes the operation code of the third container 874 to produce controlsignals for the register file 860 and third operation unit 861. Thearithmetic and logic operation instruction, transfer instruction,comparison instruction, and shift instruction are mainly specified inthe field of the third container 874. The fourth decoder 857 mainlydecodes the operation code of the fourth container 875 to producecontrol signals for the register file 860 and fourth operation unit 862.The multiply-add operation instruction is mainly specified in the fieldof the fourth container 875.

Such an arrangement allows greatly sophisticated parallel processing.The present invention is applicable also in such a case. For example, toexecute the subtract-square-add operation, the second, third, and fourthoperation units 859, 861, 862 may execute the load, subtraction, andabsolute-add operation in parallel, respectively, to provide thethroughput of processing such that one subtract-square-add operation isexecuted per cycle. Similarly, the IFFT processing is also executed athigh speeds. According to the present invention, the number ofarithmetic units is not limited but may be determined on the basis of atrade-off between required performance and costs. If the throughput ofdata transfer is insufficient, four words should be transferred inparallel. If the number of registers is insufficient, the number ofregisters should be increased which requires the increase in bit length.

For improvement in the throughput of the multiply-add operation, aplurality of operation units should have multiply-add operation units toincrease the number of operands to be transferred. The present inventionis also applicable in this case.

Fourth Preferred Embodiment

FIG. 37 is a block diagram of the second operation unit 120 of the dataprocessor according to a fourth preferred embodiment of the presentinvention corresponding to the second operation unit 117 of the dataprocessor of the first preferred embodiment. Other units of the dataprocessor of the fourth preferred embodiment are substantially identicalin construction with those of the first preferred embodiment.

In the fourth preferred embodiment, the result of the multiply-addoperation is stored in the register for each operation, and no guardbits are provided. Additional guard bits may be provided if required interms of operation accuracy.

FIG. 38 illustrates an instruction format for the data processor of thefourth preferred embodiment wherein the instruction is 64 bits inlength. The instruction comprises 2 FM bits 941, a 31-bit left-handcontainer 942, and a 31-bit right-hand container 943. FIG. 39illustrates a basic format of each container which is basically a3-operand format including three fields: two source register numberspecification fields 946 and 947, and one destination register numberspecification field 945. There are provided 64 registers, and theregister number is specified in a 6-bit field.

The basic pipeline processing of the fourth preferred embodiment issimilar to that of the first preferred embodiment, and the descriptionthereof will be dispensed with. An S4 bus 913, an S5 bus 914, a D2 bus915, and a D3 bus 916 operate at the E stage 403. These buses mainlytransfer operand values of an ordinary integer operation instruction andthe like. An S6 bus 911, an S7 bus 912, a D4 bus 917, and a D5 bus 918operate at the E2 stage 406. These buses mainly transfer accumulatedvalue. An adder 934 and its input and output units operate at the Estage 403 and E2 stage 406, but other arithmetic units and latchesoperate at the E stage 403.

A “mac” instruction for multiply-add operation is a 3-operandinstruction. The processing specifications of “mac Rdest, Rsrc1, Rsrc2”are such that the register value specified by Rsrc1 and the registervalue specified by Rsrc2 are multiplied together and the result ofmultiplication is added to the value in the pair of registers specifiedby Rdest. The hardware processing is described in detail. First, at theE stage 403, the register values specified by Rsrc1 and Rsrc2 aretransferred from a register file 903 through the S4 bus 913 and S5 bus914 to an X latch 938 and a Y latch 939, respectively. A multiplier 940performs multiplication, and the P register 941 holds the result ofmultiplication. Then, at the E2 stage 406, the values in the pair ofregisters specified by Rdest is applied to an A latch 931 from theregister file 903 through the S6 bus 911, the S7 bus 912, and theshifter 930. The value in the P register 941 is applied to a B latch933. The adder 934 adds the values in the A latch 931 and B latch 933together. The result of addition is outputted through a saturationcircuit 937 to the D4 bus 917 and D5 bus 918, and written back to thepair of registers specified by Rdest in the register file 903.

FIG. 40 illustrates an example of the program in a loop part of the FIRfilter. The 2-word load instruction and the multiply-add operationinstruction are decoded in parallel and executed in parallel. Thisachieves one multiply-add operation per cycle in the same manner as inthe data processor of the first preferred embodiment except that theresult of the multiply-add operation is written back to the registerfile for each operation.

In this manner, the technique of the present invention is also effectivewhen the cumulative result of the multiply-add operations is held in theregister. The operands are not bypassed herein, but a bypass path may beprovided as required.

Fifth Preferred Embodiment

In the data processor of the above described preferred embodiments, oneword is assumed to be 16 bits in length. However, one word may be anynumber of bits in length. For example, audio processing requires about24 bits in length, and one word may be 24 bits in length. One word maybe 32 bits long in terms of alignment with a processor. In this case,the multiplier may not necessarily be of 1-word by 1-word form, but amultiplier of a size essential in terms of the accuracy of anapplication to be processed should be selected and implemented.

Sixth Preferred Embodiment

In the data processor of the above described preferred embodiments, themultiply-add operation is subjected to two-stage pipeline processing.However, the last addition stage of the multiplier and the adder foraddition may be merged to execute the multiply-add operation in onecycle. Additionally, for high-speed operation, the multiplication may beperformed by 2-stage pipeline to permit the multiply-add operation to beperformed by 3-stage pipeline. Other pipelines may be freely selectable.For instance, the E stage 403 and M stage 404 may be merged into onepipeline stage for processing. Further, to improve the operatingfrequency, write back operation to the register may be performed in thedifferent pipeline stage from E stage. In addition, data may be bypassedfrom the write path to the register, which is effective for high-speedprocessing.

Seventh Preferred Embodiment

In the above described preferred embodiments of the present invention,the microprocessor of VLIW architecture is illustrated as an example.However, the technique of the present invention is also applicable to asuperscalar RISC processor and the like programmed if the details ofhardware are noted and the conditions for parallel processing areseized. The difference is that the parallel execution is encoded in theprogram or the hardware determines whether or not the parallel executionis permitted. The FM bits may be absent in the VLIW architecture. Theinstructions may be always executed in parallel. In this case, theoperation which is not executed in parallel should be set to NOP.

While the invention has been described in detail, the foregoingdescription is in all aspects illustrative and not restrictive. It isunderstood that numerous other modifications and variations can bedevised without departing from the scope of the invention.

We claim:
 1. A data processor comprising: a first memory for storing aninstruction including a first operation code and a second operationcode; a second memory for storing data values; a first decoder receivingsaid first operation code from said first memory for decoding said firstoperation code; a second decoder receiving said second operation codefrom said first memory for decoding said second operation code, saidfirst and second operation codes being decoded in parallel; a registerfile including a plurality of registers for storing data values to betransferred from and to said second memory; an operation unit forreceiving a first data value stored in a first register of said registerfile to perform an arithmetic operation using said first data value inresponse to a first control signal, said first control signal being adecoded result of said first operation code output from said firstdecoder; and an operand access unit for performing a memory access totransfer in parallel second and third data values stored in said secondmemory to second and third registers of said register file,respectively, in response to a second control signal, said secondcontrol signal being a decoded result of said second operation codeoutput from said second decoder, said memory access of said operandaccess unit and said arithmetic operation of said operation unit areperformed in parallel.
 2. The data processor of claim 1, wherein saidsecond and third data values each are n bits in length, where n is anintegral number, and said second and third data values are combinedtogether into a 2n-bits data value when said second and third datavalues are transferred to said register file.
 3. The data processor ofclaim 1, wherein said operation unit comprises: a multiplier formultiplying together a fourth data value stored in a fourth register ofsaid register file and said first data value; and an adder for addingtogether at least a result from said multiplier and a value stored insaid register file to cause said register file to store a result fromsaid adder.
 4. The data processor of claim 1, wherein said operationunit comprises: a multiplier for multiplying together a fourth datavalue stored in a fourth register of said register file and said firstdata value, an accumulator for holding an accumulated data value whichis a result of an operation, and an adder for adding together at least aresult from said multiplier and the accumulated data value held in saidaccumulator to cause said accumulator to hold a result from said adder.5. The data processor of claim 1, wherein said operation unit receives afourth data value stored in a fourth register of said register file,said arithmetic operation in said operation unit including a multiplyingoperation of said first and fourth data values and an adding operationat least on a result of said multiplying operation and a fifth datastored in said register file or an accumulator provided with saidoperation unit; said operand access unit causes an operand address to beapplied to said second memory simultaneously with the multiplyingoperation of said operation unit in a first period, and causes saidsecond and third data values to be transferred from said second memoryin response to the operand address simultaneously with the addingoperation of said operation unit in a second period following said firstperiod.
 6. The data processor of claim 5, wherein said operand addressis stored in a fifth register of said register file and operand accessunit causes said fifth register to be updated into another operandaddress in said first period.
 7. A data processor comprising: a firstmemory for storing an instruction including a first operation code and asecond operation code; a second memory for storing data values; a firstdecoder receiving said first operation code from said first memory, fordecoding said first operation code; a second decoder receiving saidsecond operation code from said first memory, for decoding said secondoperation code, said first and second operation codes being decoded inparallel; a register file including a plurality of registers for storingdata values to be transferred from and to said second memory; anoperation unit for receiving a first data value stored in a firstregister of said register file to perform an arithmetic operation usingsaid first data value in response to a first control signal, said firstcontrol signal being a decoded result of said first operation codeoutput from said first decoder; and an operand access unit forperforming a memory access to transfer in parallel second and third datavalues stored respectively in second and third registers of saidregister file to said second memory in response to a second controlsignal, said second control signal being a decoded result of said secondoperation code output from said second decoder, said memory access ofsaid operand access unit and said arithmetic operation of said operationunit are performed in parallel.
 8. The data processor of claim 7,wherein said second and third data values each are n bits in length,where n is an integral number, and said second and third data values arecombined together into a 2n-bits data value when said second and thirddata values are transferred to said second memory.
 9. The data processorof claim 7, wherein said operation unit comprises: a multiplier formultiplying together a fourth data value stored in a fourth register ofsaid register file and said first data value, and an adder for addingtogether at least a result from said multiplier and a value stored insaid register file to cause said register file to store a result fromsaid adder.
 10. The data processor of claim 7, wherein said operationunit includes a multiplier for multiplying together a fourth data valuestored in a fourth register of said register file and said first datavalue, an accumulator for holding an accumulated data value which is aresult of an operation, and an adder for adding together at least aresult from said multiplier and the accumulated data value held in saidaccumulator to cause said accumulator to hold a result from said adder.11. A data processor comprising: a memory for storing data; a firstinstruction decoder receiving first and second operation codes, fordecoding said first and second operation codes to output first andsecond control signals respectively; a second instruction decoderreceiving third and fourth operation codes, for decoding said third andfourth operation codes to output third and fourth control signals,respectively, said first and third operation codes being decoded inparallel and said second and fourth operation codes being decoded inparallel; a register file connected to said memory and including aplurality of registers each for storing at least one of data and anoperand address; an operation unit for performing an arithmeticoperation on the data stored in said register file; and a memory accessdevice operated in parallel with said operation unit for causing saidoperand address stored in said register file to be applied to saidmemory and for updating said operand address, wherein, in a firstprocessing, first and second decoders receive said first and thirdoperation codes respectively, and executed in parallel are processingof: (a) said operation unit to receive first data stored in a firstregister of said register file to perform an arithmetic operation inresponse to said first control signal, and (b) said memory access deviceto cause a first operand address stored in a second register of saidregister file to be applied to said memory to cause second data storedin said memory to be transferred to a third register of said registerfile and to update said first operand address to write a second operandaddress into said second register in response to said third controlsignal, and wherein, in a second processing, said first and seconddecoders receive said second and fourth operations codes respectively,and executed in parallel are processing of: (c) said operation unit toreceive said second data stored in said third register of said registerfile to perform an arithmetic operation in response to said secondcontrol signal, and (d) said memory access device to cause said secondoperand address stored in said second register of said register file tobe applied to said memory to cause third data stored in said memory tobe transferred to a fourth register of said register file and to updatesaid second operand address to write a third operand address into saidsecond register in response to said fourth control signal, said firstprocessing and said second processing being executed by pipelinecontrol.
 12. A method of processing data by a data processor whichincludes a memory for storing data, a register file connected to saidmemory and including a plurality of registers each for storing at leastone of data and an operand address, an operation unit for receiving thedata stored in said register file to perform an arithmetic operation,and a memory access device for causing the operand address stored insaid register file to be applied to said memory, said method comprisingthe steps of: (a) transferring, in parallel, first and second datastored in a first area of said memory to first and second registers ofsaid register file, respectively; (b) transferring, in parallel, thirdand fourth data stored in a second area of said memory to third andfourth registers of said register file, respectively; (c) applying saidfirst data stored in said first register and said third data stored insaid third register to said operation unit to perform an arithmeticoperation of said first and third data by said operation unit; and (d)applying said second data stored in said second register and said fourthdata stored in said fourth register to said operation unit to perform anarithmetic operation of said second and fourth data by said operationunit.
 13. The method of claim 12, further comprising the steps of: (e)transferring, in parallel, fifth and sixth data stored in a third areaof said memory to fifth and sixth registers of said register file,respectively; and (f) transferring, in parallel, seventh and eighth datastored in a fourth area of said memory to seventh and eighth registersof said register file, respectively, wherein one of the steps (c) and(d) is executed in parallel with at least one of the steps (e) and (f).14. The method of claim 13, wherein said third area is the same as saidfirst area, and said fourth area is the same as said second area. 15.The method of claim 12, wherein said first and second data each are nbits in length, where n is an integral number, and wherein said firstand second data are combined together into 2n-bit data when said firstand second data are transferred to said register file.
 16. The method ofclaim 12, wherein the step (c) comprises the sub-steps of: multiplyingsaid first and third data together; and adding data stored in a ninthregister to the result of multiplication to store the result of additionas ninth data in said ninth register, and wherein the step (d) comprisesthe sub-steps of: multiplying said second and fourth data together; andadding said ninth data stored in said ninth register to the result ofmultiplication to store the result of addition in said ninth register.17. A data processor comprising: an instruction decoder for decodingfirst and second operation codes to output first and second controlsignals, respectively; a register file including a plurality ofregisters; a memory for storing data; an operand access unit forperforming a first memory access to transfer first and second data inparallel from a first area of said memory to said register file inresponse to said first control signal, and performing a second memoryaccess to transfer third and fourth data in parallel from a second areaof said memory to said register file in response to said second controlsignal; and an operation unit receiving said first to fourth data fromsaid register file, for performing a first arithmetic operation on saidfirst and third data and a second arithmetic operation on said secondand fourth data.
 18. The data processor of claim 17, wherein said firstarithmetic operation is a multiplying operation on said first and thirddata and said second arithmetic operation is multiplying operation onsaid second and fourth data, and said operation unit generates a valuewhich is a result of adding a result of said first arithmetic operation,a result of said second arithmetic operation and fifth data in saidregister file or an accumulator provided with said operation unit. 19.The data processor of claim 17, wherein said instruction decoder decodesthird and fourth operation codes to output third and fourth controlsignals, respectively; said operation unit performs said firstarithmetic operation in response to said third control signal andperforms said second arithmetic operation in response to said fourthcontrol signal.
 20. The data processor of claim 17, wherein said firstto fourth data have the same data lengths.
 21. The data processor ofclaim 17, a first operand address is stored in a register of saidregister file, said operand access unit causes said first operandaddress to be applied to said memory and said register to be updatedinto a second operand address in response to said first control signal,and causes said second operand address stored in said register to beapplied to said memory in response to said second control signal. 22.The data processor of claim 17, wherein: a first operand address isstored in a first register of said register file, and a second operandaddress is stored in a second register of said register file, and saidoperand access unit causes said first operand address to be applied tosaid memory and said first register to be updated into a third operandaddress in response to said first control signal, and causes said secondoperand address to be applied to said memory and said second register tobe updated into a fourth operand address in response to said secondcontrol signal.
 23. A data processor comprising: a first decoder fordecoding operation codes including first and second operation codes; anoperand access unit for outputting a first address to receive inparallel first and second data values included in a first area of amemory in accordance with decoding the first operation code by saidfirst decoder, and for outputting a second address to receive inparallel third and fourth data values included in a second area of thememory in accordance with decoding the second operation code by saidfirst decoder; and an operation unit coupled to said operand access unitand receiving said first to fourth data values, for calculating a firstproduct of said first and third data values, and a second product ofsaid second and fourth data values.
 24. The data processor of claim 23,further comprising: a first register for storing the first address andoutputting the first address to said operand access unit; a secondregister for storing the second address and outputting the secondaddress to said operand access unit; and an address calculator, forcalculating a third address on the basis of the first address in a firstperiod in accordance with decoding the first operation code by saidfirst decoder to write back the third address to said first register,and calculating a fourth address on the basis of the second address in asecond period following the first period in accordance with decoding thesecond operation code by said first decoder to write back the fourthaddress to said second register.
 25. The data processor of claim 24,wherein: said address calculator calculates the third address by addingthe first address with a predetermined value and calculates the fourthaddress by adding the second address with the same value as thepredetermined value.
 26. The data processor of claim 23, wherein: saidoperation unit includes an accumulator, said operation unit calculatinga sum of a value stored in said accumulator, the first product and thesecond product.
 27. The data processor of claim 23, further comprising:a second decoder operative in parallel with said first decoder, fordecoding operation codes, wherein an operation unit calculates the firstproduct in accordance with decoding the a third operation code by saidsecond decoder, and the second product of said second and fourth datavalues in accordance with decoding the a fourth operation code by saidsecond decoder.
 28. A method of processing data by a data processorconnected to a memory and executing instructions described in a program,said method comprising the steps of: transferring in parallel first andsecond data values included in a first area of the memory to said dataprocessor; transferring in parallel third and fourth data value includedin a second area of the memory to said data processor; calculating aproduct of the first and third data values in the data processor; andcalculating a product of the second and fourth data values in the dataprocessor.
 29. A data processor comprising: a first decoder configuredto receive a decode a first operation code specifying a data loadoperation; a second decoder configured to receive and decode a secondoperation code specifying a multiply-add operation; a first operationunit configured to provide a memory with an operand address of saidfirst operation code, at least one said operand address configured tocause plural operand data to be loaded in parallel from the memory inresponse to a first control signal being a decoded result of said firstoperation code output from said first decoder; a plurality of registersconfigured to receive and store data values included in operand dataloaded from the memory; and a second operation unit configured toreceive data values from said registers, and to perform the multiply-addoperation in response to a second control signal being a decoded resultof said second operation code output from said second decoder; whereinsaid multiply-add operation includes a multiplying operation of one byanother of the data values received by the second operation unit, and anadding operation utilizing a result of said multiplying operation.
 30. Adata processor comprising: a first decoder configured to receive anddecode a first operation code specifying a data load operation; a seconddecoder configured to receive and decode a second operation codespecifying a multiply-add operation; a plurality of registers includinga register configured to store an operand address of said firstoperation code and to output the operand address to a memory to cause,for at least one said operand address, plural operand data to be loadedin parallel from the memory, said plurality of registers includingdifferent two registers configured to receive and store first and seconddata values, respectively, included in the plural operand data loaded inparallel from the memory; an arithmetic unit configured to generate anew operand address using the operand address and to update contents ofthe register storing the operand address into the new operand address inresponse to a first control signal being a decoded result of the firstoperation code output from said first decoder; and an operation unitconfigured to receive third and fourth data values stored in differenttwo of said plurality of registers, respectively, and to perform themultiply-add operation in response to a second control signal being adecoded result of said second operation code output from said seconddecoder; wherein said multiply-add operation includes a multiplyingoperation of the third data value by the fourth data value and an addingoperation utilizing a result of said multiplying operation.
 31. A dataprocessor comprising: a first decoder configured to receive and decode afirst operation code specifying a data load operation; a second decoderconfigured to receive and decode a second operation code specifying amultiply-add operation; a first operation unit configured to provide amemory with an operand address of said first operation code to cause,for at least one said operand address, plural operand data to be loadedin parallel from the memory in response to a first control signal beinga decoded result of said first operation code output from said firstdecoder; a register file configured to store first and second datavalues which are included in the plural operand data loaded in parallelfrom the memory; and a second operation unit configured to receive thirdand fourth data values stored in said register file and to perform themultiply-add operation in response to a second control signal being adecoded result of said second operation code output from said seconddecoder; wherein said multiply-add operation includes a multiplyingoperation of the third data value by the fourth data value and an addingoperation utilizing a result of said multiplying operation; and whereinsaid second operation code has an operand specifying field capable ofspecifying one of the first and second data values as the third datavalue, and a data value different from said first and second data valuesas the fourth data value.
 32. A data processor comprising: a firstdecoder configured to receive and decode a first operation codespecifying a data load operation; a second decoder configured to receiveand decode a second operation code specifying a multiply-add operation;a register file configured to store an operand address of said firstoperation code and to output the operand address to a memory to cause,for at least one said operand address, plural operand data to be loadedin parallel from the memory, said register file further configured tostore first and second data values which are included in the operanddata loaded in parallel from the memory; an arithmetic unit configuredto generate a new operand address using the operand address and toupdate the operand address stored in said register file into the newoperand address in response to a first control signal being a decodedresult of the first operation code output from said first decoder; andan operation unit configured to receive third and fourth data valuesstored in said register file and to perform the multiply-add operationin response to a second control signal being a decoded result of saidsecond operation code output from said second decoder; wherein saidmultiply-add operation includes a multiplying operation of the thirddata value by the fourth data value and an adding operation utilizing aresult of said multiplying operation; and wherein said second operationcode has an operand specifying field capable of specifying one of thefirst and second data values as the third data value, and a data valuedifferent from said first and second data values as the fourth datavalue.
 33. The data processor according to claim 29, wherein: said firstdecoder sequentially decodes operation codes including said firstoperation code and specifying operations to be executed; and said seconddecoder, operative in parallel with said first decoder, sequentiallydecodes operation codes including said second operation code andspecifying operations to be executed.
 34. The data processor accordingto claim 29, wherein: said plurality of registers have a same bitlength; and said second operation code is capable of specifying each ofsaid plurality of registers as an operand of said second operation code.35. The data processor according to claim 29, wherein: said data loadoperation is a plural-operand data load operation; and each operandaddress of said first operation code is configured to cause pluraloperand data to be loaded in parallel from the memory.
 36. The dataprocessor according to claim 29, wherein: data values included in theplural operand data loaded from the memory are stored in differentrespective registers in the plurality of registers.
 37. The dataprocessor according to claim 30, wherein: said data load operation is aplural-operand data load operation; and each operand address of saidfirst operation code is configured to cause plural operand data to beloaded in parallel from the memory.
 38. The data processor according toclaim 31, wherein: said first decoder sequentially decodes operationcodes including said first operation code and specifying operations tobe executed; and said second decoder, operative in parallel with saidfirst decoder, sequentially decodes operation codes including saidsecond operation code and specifying operations to be executed.
 39. Thedata processor according to claim 31, wherein: said data load operationis a plural-operand data load operation; and each operand address ofsaid first operation code is configured to cause plural operand data tobe loaded in parallel from the memory.
 40. The data processor accordingto claim 31, wherein: data values included in the plural operand dataloaded from the memory are stored in different respective registers inthe register file.
 41. The data processor according to claim 32,wherein: said data load operation is a plural-operand data loadoperation; and each operand address of said first operation code isconfigured to cause plural operand data to be loaded in parallel fromthe memory.
 42. The data processor according to claim 32, wherein: datavalues included in the plural operand data loaded from the memory arestored in different respective registers in the register file.
 43. Adata processor for loading values A1 to Am and values D1 to Dm, and forperforming a calculation:$\sum\limits_{t = 1}^{m}\quad {{Ai} \cdot {Di}}$

in accordance with a program having a plurality of operation codes,wherein i and m are positive integers, said data processor comprising: afirst decoder configured to decode a first operation code in theprogram; a second decoder configured to decode a second operation codein the program in parallel with the decoding of the first operation codeby said first decoder; a first operation unit configured to output anoperand address to a memory in response to a decoded result of the firstoperation code, the first operation unit configured to load values Ahand Aj in parallel from the memory for at least one said operandaddress, wherein h and j are different integers among 1 to m; and asecond operation unit configured to perform a multiply-add operation inresponse to a decoded result of the second operation code output fromsaid second decoder, the multiply-add operation including a multiplyoperation of a value Ak by a value Dk, wherein k is an integer among 1to m other than h and j, and an adding operation of a result of themultiply operation to a value held in a register or in an accumulator.44. The data processor according to claim 43, wherein: said firstoperation code specifies a plural-operand data load operation; and eachoperand address of said first operation code is configured to causeplural operand data to be loaded in parallel from the memory.
 45. A dataprocessor for loading values A1 to Am and values D1 to Dm, and forperforming a calculation:$\sum\limits_{i = 1}^{m}\quad {{Ai} \cdot {Di}}$

in accordance with a program having a plurality of operation codes,wherein i and m are positive integers, said data processor comprising: afirst arithmetic operation circuit configured to multiply a value Aj bya value Dj and to multiply a value Ak by a value Dk, wherein j and k areintegers; a second arithmetic operation circuit configured to add avalue Pk=Ak·Dk to a value held in a register or in an accumulator; andan operand access unit configured to output an operand address to amemory, and to load a value Ah and a value An in parallel in accordancewith at least one said operand address, wherein h and n are integers;and a control unit configured to control said first and secondarithmetic operation circuits and said operand access unit in accordancewith the program so that the multiplication of Aj·Dj, the addition ofthe value Pk to the value held in the register or in the accumulator,and the loading of the values Ah and An, are performed in parallel. 46.The data processor according to claim 45, wherein: each operand addressis configured to cause plural operand data to be loaded in parallel froma memory.
 47. A data processor for executing a program, the dataprocessor comprising: storage devices configured to store data values; afirst arithmetic operation unit configured to perform a multiplyingoperation utilizing data values stored in said storage devices; a secondarithmetic operation unit configured to perform an adding operationutilizing data values stored in said storage devices; an operand accessunit configured to output an operand address to a memory and to loadplural operand data in parallel for at least one said operand address;and a control unit configured to control said first and secondarithmetic operation units and said operand access unit, depending onsaid program, so that a result of the multiplying operation, a result ofthe adding operation unit, and the loaded plural operand data, arestored in parallel in said storage devices.
 48. The data processoraccording to claim 47, wherein: data values included in the pluraloperand data loaded by the operand access unit are stored in differentrespective registers in said storage devices.